On approximating weighted sums with exponentially many terms
-
Upload
deepak-chawla -
Category
Documents
-
view
212 -
download
0
Transcript of On approximating weighted sums with exponentially many terms
http://www.elsevier.com/locate/jcss
Journal of Computer and System Sciences 69 (2004) 196–234
On approximating weighted sums with exponentiallymany terms$
Deepak Chawla,1 Lin Li, and Stephen Scott�
Department of Computer Science, University of Nebraska, Lincoln, NE 68588-0115, USA
Received 28 March 2003; revised 8 January 2004
Abstract
Multiplicative weight-update algorithms such as Winnow and Weighted Majority have been studiedextensively due to their on-line mistake bounds’ logarithmic dependence on N; the total number of inputs,which allows them to be applied to problems where N is exponential. However, a large N requirestechniques to efficiently compute the weighted sums of inputs to these algorithms. In special cases, theweighted sum can be exactly computed efficiently, but for numerous problems such an approach seemsinfeasible. Thus we explore applications of Markov chain Monte Carlo (MCMC) methods to estimate thetotal weight. Our methods are very general and applicable to any representation of a learning problem forwhich the inputs to a linear learning algorithm can be represented as states in a completely connected,untruncated Markov chain. We give theoretical worst-case guarantees on our technique and then apply it totwo problems: learning DNF formulas using Winnow, and pruning classifier ensembles using WeightedMajority. We then present empirical results on simulated data indicating that in practice, the timecomplexity is much better than what is implied by our worst-case theoretical analysis.r 2003 Elsevier Inc. All rights reserved.
Keywords: Markov chain Monte Carlo approximation; Winnow; Weighted Majority; Multiplicative weight updates;
Perceptron; DNF learning; Boosting
1. Introduction
Multiplicative weight-update algorithms (e.g. [6,21,24]) have been studied extensively due totheir on-line mistake bounds’ logarithmic dependence on N; the total number of inputs. (These
ARTICLE IN PRESS
$A preliminary version [7] of this paper appeared in COLT 2001.�Corresponding author.
E-mail address: [email protected] (S. Scott).
URL: http://www.cse.unl.edu/~sscott.1Now at EMC Corporation, Raleigh-Durham, NC.
0022-0000/$ - see front matter r 2003 Elsevier Inc. All rights reserved.
doi:10.1016/j.jcss.2004.01.006
bounds can be translated into PAC sample complexity bounds via a simple procedure [22].) Thisattribute efficiency allows them to be applied to problems where N is exponential in the input size,which is the case in many applications, including using Winnow [21] to learn DNF formulas inunrestricted domains and using the Weighted Majority algorithm (WM [24]) to predict nearly aswell as the best pruning of a classifier ensemble (from e.g. boosting). However, a large N requirestechniques to efficiently compute the weighted sums of inputs to WM and Winnow. One methodof doing this is to exploit commonalities among the inputs, partitioning them into a polynomialnumber of groups such that given a single member of each group, the total weight contribution ofthat group can be efficiently computed [13–15,25,30,39]. But many WM and Winnow applicationsdo not appear to exhibit such structure, so it seems that a brute-force implementation is the onlyoption to guarantee complete correctness.2 Thus we explore applications of Markov chain MonteCarlo (MCMC) methods to estimate the total weight without the need for special structure in theproblem. Our methods are very general and applicable to any representation of a learningproblem for which the inputs to the linear learning algorithm can be represented as states in acompletely connected, untruncated3 Markov chain. In this paper we apply our results to two suchproblems, described below.First we study learning DNF formulas (e.g. [5]) using Winnow4 [21] and not using membership
queries. We enumerate all possible DNF terms and use Winnow to learn a monotone disjunctionover these terms, which it can do while making OðK log NÞ prediction mistakes, where K is thenumber of relevant terms and N is the total number of terms. So a brute-force implementation ofWinnow makes a polynomial number of errors on arbitrary examples (i.e. with no distributionalassumptions) and does not require membership queries. However, a brute-force implementationrequires exponential time to compute the weighted sum of the inputs. So we apply our MCMC-based results to estimate this sum.Next we investigate pruning a classifier ensemble (from e.g. boosting), which can reduce
overfitting and time for evaluation [26,40]. We use the Weighted Majority algorithm (WM) [24],using all possible prunings as experts. WM is guaranteed to not make many more predictionmistakes than the best expert, so we know that a brute-force WM will perform nearly as well asthe best pruning. However, the exponential number of prunings motivates us to use an MCMCapproach to approximate the weighted sum of the experts’ predictions.MCMC methods [16] have been applied to problems in approximate summation, where the
goal is to approximate W ¼P
xAOsðxÞ; where s is a positive function and O is a finite set of
combinatorial structures. It involves defining an ergodic Markov chain M with state space O andstationary distribution p: Then one repeatedly simulates M to draw samples almost according top: Under appropriate conditions, this technique yields accuracy guarantees. E.g. sometimes onecan guarantee that the estimate of the sum is within a factor e of the true value (with high
ARTICLE IN PRESS
2For additive weight-update algorithms such as the Perceptron algorithm, kernels can often be used to exactly
compute the weighted sums (e.g. [8,29,36]), though a kernel function might not exist for the desired mapping of features.3We believe that the untruncated requirement can be removed by generalizing results of Morris and Sinclair [28] (see
Section 4.2).4We also study the application of our methods to learning DNF via Rosenblatt’s Perceptron [33] algorithm, though
this is done only for contrast with Winnow since exact sums for DNF learning via Perceptron can be computed with
kernels [18].
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 197
probability). When this is true and the estimation algorithm requires only polynomial time, thealgorithm is called a fully polynomial randomized approximation scheme (FPRAS).We combine two FPRASs for application to estimating weighted sums. The first approximator
is for the approximate knapsack problem [10,28], where given a positive real vector ~ww and real
number b; the goal is to estimate jf~ppAf0; 1gn: ~ww � pð~xxÞpbgj within a multiplicative factor e: Theother FPRAS is for estimating the sum of the weights of a weighted matching of a graph: for a
graph G and lX0; approximate ZGðlÞ ¼Pn
k¼0 mklk; where mk is the number of matchings in G
of size k and n is the number of nodes. This problem has applications to the monomer-dimer
problem of statistical physics [16].While we have thoroughly analyzed our approach in the context of these two problems, our
results do not guarantee efficient algorithms for learning DNF and for finding the best pruning.5
But we do provide theoretical machinery that could potentially be applied to analyze algorithmsthat learn e.g. restricted cases of DNF, including subclasses of DNF formulas and/or specificdistributions over examples. Further, our experimental results provide interesting insights intothe algorithms’ behaviors and show that the weighted sums can be approximated well despitethe pessimistic worst-case bounds. Couple this with the fact that good approximations of theweighted sums are not always necessary to accurately simulate Winnow and WM (since we areonly interested in the predictions made based on these weighted sums, not the sums themselves),and our results have potential to be effective tools in theory and in practice.The rest of this paper is organized as follows. In Section 2 we give background on the on-line
learning model and summarize related work in learning DNF formulas, pruning ensembles ofclassifiers, and MCMC methods. Section 3 presents our algorithm and Markov chain, and provesgeneral bounds on the accuracy and time complexity of our estimation procedure. In Section 4 weapply these results to the problems of learning DNF formulas with Winnow and Perceptron andpruning ensembles with Weighted Majority. Then some empirical results appear in Section 5.Finally, we conclude in Section 6 with a description of future and ongoing work.
2. Related work
2.1. The on-line learning model
We focus on on-line learning algorithms, where learning proceeds in a series of trials.6 In trial t;
an example ~XX t is presented to the learning algorithm A; which makes a prediction7 #ct of ~XX t’slabel. After this prediction is made, A is told the true label ct; which A uses to update its
hypothesis before making future predictions. If cta#ct; we say that A made a prediction mistake. IfM is the set of examples for which a mistake is made, the goal is to minimize jMj on any sequence
of adversarily-generated examples X ¼ ð~XX 1;y; ~XX tÞ: Below we overview the on-line learningalgorithms Winnow [21], Perceptron [33], and Weighted Majority (WM) [24].
ARTICLE IN PRESS
5This is not surprising since is unlikely that an efficient distribution-free DNF-learning algorithm exists [2,3].6 In Sections 3 and 4, our results focus on only the current trial, so we omit the subscript t unless it is not clear from
context.7We assume ct; #ctAf1;þ1g:
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234198
Winnow maintains a weight vector ~wwARþN (N-dimensional positive real space). Upon
receiving an instance ~XX tA½0; 1�N ;Winnow makes its prediction #ct ¼ þ1 if Wt ¼ ~wwt � ~XX tXy and 1otherwise (y40 is a threshold). Given the true label ct; the weights are updated as follows:
wtþ1;i ¼ wt;i aXt;iðct#ctÞ=2 for some a41: If wtþ1;i4wt;i we call the update a promotion and if
wtþ1;iowt;i we call it a demotion. Littlestone [21] showed that if each example is labeled by some
monotone disjunction of K of its N inputs, then Winnow will never make more than OðK log NÞprediction mistakes on any sequence of examples. This makes Winnow a natural tool to apply tolearning DNF since by enumerating all 3n possible terms as inputs to Winnow, K-term DNF canbe learned while making only OðKnÞ prediction mistakes. However, the time complexity ofrunning Winnow this way is exponential in n:
Similar to Winnow, the Perceptron algorithm maintains a weight vector ~wwARN : Upon
receiving an instance ~XX tA½0; 1�N ; it makes its prediction #ct ¼ þ1 if Wt ¼ ~wwt � ~XX tXy and 1
otherwise.8 Given the true label ct; the weights are updated as follows: wtþ1;i ¼ wt;i þ aXt;iðct #ctÞ=2 for some a40: In contrast to Winnow, the Perceptron algorithm can be forced to makeOðKNÞ mistakes on monotone K-disjunctions over N inputs [20], making it inappropriate forlearning DNF (see also Khardon et al. [18]). However, the additive nature of the weight updatesyields much better time complexity bounds for MCMC in contrast to those for multiplicativeweight-update schemes (Section 4.1.2).Inputs to the Weighted Majority algorithm [24] are themselves predictions of ‘‘experts’’ on the
current example9 ~xxt: Each such expert ei in the pool has its own weight wi (initialized to 1), andwhen a new example ~xxt is given to each expert in the pool, expert ei sends to WM its predictionXt;i ¼ eið~xxtÞAR; where the sign of Xt;i indicates ei’s predicted label and jXt;ij can be thought of as
ei’s confidence (though some experts may restrict themselves to predictions from f1;þ1g). WM
then takes a weighted combination of the predictions and predicts #ct ¼ þ1 if Wt ¼ ~wwt � ~XX tX0 and1 otherwise. Upon receiving the correct label ct; if WM makes a prediction mistake, it reducesthe weights of all experts that predicted incorrectly by dividing them by some constant a41: It hasbeen shown that if the best expert in the pool makes at most n mistakes, then WM has a mistakebound10 of Oðnþ log NÞ: Applying this to predicting nearly as well as the best pruning of anensemble is straightforward. By placing each possible pruning into the pool, we get a pool size ofN ¼ 2n and thus a mistake bound of Oðnþ nÞ: However, the time complexity of a straightforwardimplementation of this algorithm is exponential in n:
2.2. Learning DNF formulas
Learning DNF formulas has been heavily studied, but positive learning-theoretic results existonly in restricted cases, including assuming a uniform distribution over examples (e.g. [5]) or
ARTICLE IN PRESS
8For additive weight update algorithms like Perceptron, often the threshold is included in the weight vector as wt;0;
corresponding to an extra attribute Xt;0 ¼ 1: The dot product is then compared to 0 rather than y:9Throughout this paper, lower case ~xx and ~yy will represent examples in the original space, while capital ~XX and ~YY
represent the examples mapped to a new space, which is the input space of Winnow, Perceptron, and WM.10Stronger results on predicting with expert advice were given by Cesa-Bianchi et al. [6] using a more complex
algorithm, but these are only better than WM’s by a constant factor. Thus for simplicity, we use WM.
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 199
assuming that the number of terms is bounded by Oðlog nÞ; where n is the number of variables [4].In both of these cases, the algorithms require, in addition to labeled examples, membership queries,i.e. they need to be able to present arbitrary examples to an oracle and be told their labels.In contrast, directly applying Winnow to this problem by enumerating all possible terms and
learning a monotone disjunction over them does not require any restrictions or the use ofmembership queries. Since there are only 3n possible DNF terms over n variables, Winnow’smistake bound on this problem is OðKnÞ; where K is the number of relevant terms in the targetfunction. However, the time complexity to make a prediction on each example is exponential inthis case if a brute-force approach is taken. Indeed, Khardon et al. [18] showed that if P a #P,then there is no polynomial time algorithm to exactly simulate Winnow over exponentially manyconjunctive features for learning even monotone DNF. Further, while they did provide a kernelallowing them to exactly compute Perceptron’s weighted sums when learning DNF, they also gavean exponential lower bound on the number of mistakes that kernel perceptron makes in learning
DNF: 2OðnÞ:
2.3. Pruning ensembles of classifiers
Prior work in pruning ensembles of classifiers [26,40] (produced by boosting) has beenconducted for two reasons. First, the time required to evaluate a complete ensemble is prohibitivein some applications. Second, despite some evidence to the contrary [32,34], boosting can be proneto overfitting. The methods of Margineantu and Dietterich [26] and Tamon and Xiang [40] notonly sought subsets of the ensemble with high prediction accuracy, but also with high diversity, i.e.hypotheses with high accuracy on different portions of the instance space. The approaches theyused included simple ones like early stopping, ones that utilized divergence measures such asKullback–Leibler divergence or the k statistic, and methods that used prediction error, sometimescombined with a divergence measure.To address the concern of overfitting, one can use the WM algorithm, using all possible
prunings as ‘‘experts’’ in a pool. Since WM is guaranteed to not perform much worse (in terms ofnumber of on-line prediction mistakes) than the best expert in the pool, we know that a brute-force implementation of this algorithm is guaranteed to not perform much worse than the bestpruning. However, a brute-force implementation of WM would take time exponential in thenumber of hypotheses in the ensemble. So we use an MCMC approach to approximate theweighted sum of the experts’ predictions.
2.4. Markov chain Monte Carlo methods
MCMC methods [16] have been applied to problems in combinatorial optimization andapproximate summation, where the goal is to approximate weighted sum W ¼
PxAOsðxÞ; where s
is a positive function defined on O and O is a very large, finite set of combinatorial structures. Theprocess involves defining an ergodic Markov chain M with state space O and stationarydistribution p: Then one repeatedly simulates M some number of steps to draw several samplesalmost according to p: Under appropriate conditions, this technique yields accuracy guarantees.E.g. in approximate summation, sometimes one can guarantee that the estimate of the sum iswithin a factor e of the true value with high probability. When this is true and the estimation
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234200
algorithm requires only polynomial time, the algorithm is called a fully polynomial randomized
approximation scheme (FPRAS). In certain cases a similar argument can be made aboutcombinatorial optimization problems, i.e. that the algorithm’s solution is within a factor of e ofthe true maximum or minimum.A well-studied problem with an MCMC solution is the approximate knapsack problem, where
one is given a positive real vector ~ww and real number b: The goal is to estimate jOj within a
multiplicative factor of e; where O ¼ f~ppAf0; 1gn: ~ww �~pppbg (i.e. as an approximate summationproblem, sð~ppÞ ¼ 1 for all~ppAO). Dyer et al. [10] gave a Markov chain for this problem and arguedthat a polynomial (in n and 1=e) number of samples from it were sufficient to accurately estimatejOj: Later, Morris and Sinclair [28] showed that it is sufficient to simulate the chain for apolynomial number of steps to obtain each sample (i.e. that the chain is rapidly mixing), thusgiving a FPRAS for the knapsack problem.Another problem with a FPRAS [16] is computing the sum of the weights of a weighted
matching with parameter l: For a graph G and lX0; approximate ZGðlÞ ¼Pn
k¼0 mklk; where mk
is the number of matchings in G of size k and n is the number of nodes. This problem hasapplications to the monomer-dimer problem of statistical physics. In the next section, we combinethe knapsack solution with the matching solution to approximate the weighted sums of inputs oflinear learning algorithms.
3. Our general algorithm and Markov chain
In general, our state space O will consist of the set of inputs to the learning algorithmunder consideration (Perceptron, Winnow, or WM). As such, we can think of the statesof O as functions that map from examples in the original space (the ~xx variables) to the
input space of the learning algorithm (the ~XX variables). So for a state ~ppAO and an inputexample ~xx; we let pð~xxÞ denote ~pp evaluated at ~xx: E.g. when learning DNF with binary pð~xxÞ(Section 4.1.1), ~pp is a term, ~xx is an assignment to the variables, and pð~xxÞ ¼ 1 if ~xx satisfies ~pp and 0otherwise.Depending on the application and choice of pð~xxÞ in Section 4, O will take on different forms.
When learning DNF with Perceptron or Winnow and binary pð~xxÞ; we use O ¼ f0; 1gn: Whenlearning DNF with Perceptron or Winnow and pð~xxÞ a linear or logistic function (see Section 4.1.1),
we use O ¼Qn
i¼1f0;y; kig for some integers k1;y; kn40 (as described in Section 4.1, ki is the
number of values for feature i in a general DNF representation). When we use WM to prune
ensembles, we define Oþ (similarly, O) as the set of prunings that predict þ1 (similarly, 1) on
the current example. In Section 4.2 we show that Oþ and O are each simply f0; 1gn truncated bya single hyperplane. Since in this case the state space of our Markov chain is truncated, we musttake care to not exit the state space during a transition. Hence the need for Step 3 in our algorithmbelow.Consider a vector~pp ¼ ðp1;y; pi;y; pnÞ: We say that vector~pp0 is a neighbor of~pp if and only if~pp
and ~pp0 differ in at most one position. I.e. if and only if ~pp0 ¼ ðp1;y; p0i;y; pnÞ; where p0i may or
may not equal pi (if p0i ¼ pi then the edge from~pp to~pp0 is a self-loop). (Note that if O is a truncated
hypercube, then ~pp0 might not be in O; even if ~ppAO: This is why we test for membership in Step 3
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 201
below.) We now define M as a Markov chain with state space O that makes transitions from state~ppAO to state ~qqAO by the following rules.
(1) With probability 1=2 let ~qq ¼ ~pp: Otherwise:(2) Let ~pp0 be a neighbor of ~pp selected uniformly at random(3) If ~pp0AO; then let ~pp00 ¼ ~pp0; else let ~pp00 ¼ ~pp:(4) With probability minf1; p00ð~xxÞ w~pp 00=ðpð~xxÞ w~ppÞg; let~qq ¼ ~pp00; else let~qq ¼ ~pp: Here w~pp is the weight
of node ~pp in the learning algorithm.
Thus M is a random walk where the transition probabilities favor nodes with higherweights.
Lemma 1. If every state in O can be reached from every other state, then M is ergodic with
stationary distribution
ptð~ppÞ ¼pð~xxtÞ w~pp
Wt
;
where Wt ¼P
~ppAO pð~xxtÞ w~pp ; i.e. the weighted sum of inputs over all states (inputs) in O when
example ~xxt is the current example.
Proof. Since all states in O can communicate, M is irreducible. Also, the self-loop of step 1ensures aperiodicity. Finally, M is reversible since the transition probabilities
Pð~pp;~qqÞ ¼ minf1; qð~xxtÞ w~qq =ðpð~xxtÞ w~ppÞg2n
¼ minf1; ptð~qqÞ=ptð~ppÞg2n
(here n is the number of neighbors) satisfy the detailed balance condition ptð~ppÞPð~pp;~qqÞ ¼ptð~qqÞPð~qq;~ppÞ: So M is ergodic with the stated stationary distribution. &
For each new trial in an on-line algorithm, the weighted sums we estimate are potentiallydifferent. Thus we must conduct a new estimation procedure (with a new Markov chain) for eachtrial. To simplify notation, for the rest of this paper we will let the index t of each trial be implicit,omitting any subscripts unless necessary. Further, in each trial our algorithm defines multipleMarkov chains, each assuming that the weight updates of previous trials were made using adifferent11 learning rate ai: Hence the weight w~pp of a node~pp (and hence the stationary distribution
p and the sum of weights W ) will be functions of both ai and t; so we use the subscript of ai todenote these differences, leaving the t implicit.Recalling the definition of Winnow in Section 2.1, if the initial weight vector is the all 1s vector,
the weight of term ~pp is w~pp ¼ az~pp ; where z~pp ¼P
~xxAM c~xx pð~xxÞ; M is the set of examples for which a
ARTICLE IN PRESS
11Note that, however, the actual sequence of updates made will be the same regardless of ai: This sequence of updatesis determined by running the learning algorithm with the original learning rate a:
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234202
prediction mistake is made and c~xxAf1;þ1g is example ~xx’s label. We will refer to z~pp as node ~pp’s
total update, since from it we can directly compute node ~pp’s weight. In fact, if the Perceptronalgorithm is started with all weights equal to 0, then node ~pp’s weight is w~pp ¼ a z~pp : Similarly,
starting WM with all weights equal to 1 implies w~pp ¼ az~pp ; just like Winnow.
Let B be a bound (over all nodes in O) on the magnitudes of the total updates up to the currenttrial, i.e. BXmax~ppAOfjz~pp jg: Since this requires taking the maximum over an exponentially large
set, we note that it suffices to instead use BXP
~xxAM max~ppAOfjc~xx pð~xxÞjg: This quantity is easy to
bound so long as bounds on the possible values of c~xx and pð~xxÞ are known for each ~xx; which is thecase for all our algorithms. (E.g. in Winnow and Perceptron, it suffices to set B equal to the sum ofall promotions and demotions made on all examples for which a prediction mistake was made up
to the current trial.) Now let r be the smallest integer such that ð1þ 1=BÞr1Xa and rX1þ log2a
(so rp2þ Blna). Also, let z ¼ 1=ða1=ðr1Þ 1ÞXB and ai ¼ ð1þ 1=zÞi1 ¼ aði1Þ=ðr1Þ for 1pipr(so ar ¼ a).Now define fið~ppÞ ¼ wai1;~pp=wai;~pp ; where ~pp is chosen according to pai
: Then
E½fi� ¼X~ppAO
wai1;~pp
wai;~pp
� �pð~xxÞ wai;~pp
WðaiÞ¼ Wðai1Þ
W ðaiÞ:
So we can estimate Wðai1Þ=WðaiÞ by sampling states~pp from M and computing the sample meanof the fið~ppÞ: Note that
WðaÞ ¼ WðarÞW ðar1Þ
� �Wðar1ÞWðar2Þ
� �?
Wða2ÞWða1Þ
� �Wða1Þ:
So for each value a2;y; ar; we run S independent simulations of M; and let Xi be the samplemean of wai1
=wai: Then our estimate12 is
WðaÞ ¼ Wða1ÞYr
i¼2
1=Xi: ð1Þ
In order to complete our computation of W; we must also compute Wða1Þ: Due to the definitionof a1; this is straightforward for learning DNF with the various definitions of pð~xxÞ (Sections 4.1.1and 4.1.2). For the ensemble pruning problem, Wþða1Þ ¼ jOþj; where Oþ is the set of pruningsthat predict þ1 on the input ~xx: This cannot be efficiently computed exactly, so we must estimate itwith the FPRAS of Morris and Sinclair [28] (Section 4.2).The following theorem bounds the error of our algorithm’s estimates of W : The theorem is
based on variation distance, which is a distance measure between a Markov chain’s simulated andstationary distributions, defined as maxUDOjPtð~pp;UÞ pðUÞj; where Ptð~pp; �Þ is the distribution ofa chain’s state at simulation step t given that the simulation started in state ~ppAO; and p is thechain’s stationary distribution.
Theorem 2. Assume apfi; fipb for all i; where fi is the same as fi but with samples drawn according
to the distribution yielded by simulating M: Let the sample size S ¼ J130rb=ðae2Þn and M be
ARTICLE IN PRESS
12When we apply our results to the Perceptron algorithm in Section 4.1.2, we will also use a0 ¼ 0 and update the
product of ratios accordingly.
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 203
simulated long enough for each sample such that the variation distance between the empirical
distribution and paiis at most ea=ð5brÞ for each i: Also, assume that Wða1Þ can be computed exactly.
Then for any d40; WðaÞ satisfies
Pr½ð1 eÞWðaÞpWðaÞpð1þ eÞWðaÞ�X1 d:
Proof. Let the distribution #paibe the one resulting from simulating M; and assume that the
variation distance jj #pai pai
jjpea=ð5brÞ: Now consider the random variable fi; which is the same
as fi except that the terms are selected according to #pai: Since fiA½a; b�; jE½fi� E½fi�jpea=ð5rÞ;
which implies E½fi� ea=ð5rÞpE½fi�pE½fi� þ ea=ð5rÞ: Factoring out E½fi� from both sides andnoting that 1=E½fi�p1=a yields
1 e5r
� �E½fi�pE½fi�p 1þ e
5r
� �E½fi�: ð2Þ
This allows us to conclude that E½fi�XE½fi�=2: Since fipb; we get Var½fi�pb E½fi�; yieldingVar½fi�ðE½fi�Þ2
pb
E ½fi�p
2b
E ½fi�p2b=a: ð3Þ
Let Xð1Þi ;y;X
ðSÞi be a sequence of S independent copies of fi; and let %Xi ¼ ð
PSj¼1 X
ðjÞi Þ=S: Then
E½ %Xi� ¼ E½fi� and Var½ %Xi� ¼ Var½fi�=S: The estimator of W ðaÞ is Wða1Þ=X ¼ Wða1Þ=Qr
i¼2%Xi:
Since the %Xi’s are independent, E½X � ¼Qr
i¼2 E½ %Xi� ¼Qr
i¼2 E½fi� and E½X 2� ¼Qr
i¼2 E½ %X2i �: Let r ¼Qr
i¼2 Wðai1Þ=WðaiÞ; (i.e. what we are estimating with X ) and #r ¼ E½X �: Then applying Eq. (2)
gives
1 e5r
� �r
rp #rpr 1þ e5r
� �r
:
Since limr-Nð1þ e=ð5rÞÞr ¼ ee=5p1þ e=4 and ð1 e=ð5rÞÞr is minimized at r ¼ 1; we get
1 e4
� �rp #rp 1þ e
4
� �r:
Since Var½X � ¼ E½X 2� ðE½X �Þ2; we have
Var½X �ðE½X �Þ2
¼Yr1
i¼1
1þ Var½ %Xi�ðE½ %Xi�Þ2
! 1
p 1þ 2b
aS
� �r1
1 ðby Eq:ð3ÞÞ
p 1þ e2
65r
� �r
1pexpðe2=65Þ 1pe2=64:
The last inequality holds since expðx=65Þp1þ 1=64 for xA½0; 1�: We now apply Chebyshev’sinequality to X with standard deviation e #r=8:
Pr½jX #rj4e #r=4�p1=4:
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234204
So with probability at least 3=4 we get
1 e4
� �#rpXp 1þ e
4
� �#r;
which implies that with probability at least 3=4
ð1þ eÞr
X1
ð1 e=4Þ2rX1=XX
1
ð1þ e=4Þ2rXð1 eÞ
r: ð4Þ
Making the approximation with probability at least 1 d for any d40 is done by rerunningOðln 1=dÞ times the procedure for estimating X and taking the median of the results [17]. &
It is also possible to extend Theorem 2 to the case where Wða1Þ cannot be exactly computed,but can be accurately estimated.
Corollary 3. Assume apfi; fipb for all i: Let the sample size S ¼ J30 r b=ðae2Þn; Wða1Þ’s estimatebe within e=2 of its true value with probability X3=4; and M be simulated long enough for each
sample such that the variation distance between the empirical distribution and paiis at most
ea=ð10brÞ for all i: Then for any d40; WðaÞ satisfies
Pr½ð1 eÞWðaÞpWðaÞpð1þ eÞWðaÞ�X1 d:
Proof. The analysis is the same as in the proof of Theorem 2, except we now must accommodate
another source of error. First, substitute e=2 for e in Eq. (4). Given the accuracy of Wða1Þ withprobability at least 3=4; we get
Wða1Þð1 e=2Þ2
rp
Wða1ÞX
pW ða1Þð1þ e=2Þ2
r;
with probability at least 1=2: This completes the proof (the constants in S remain unchanged).Similar to Theorem 2, both estimates can be run multiple times and the median taken in order toreduce the probability of failure. &
We now bound the mixing time of M by using the canonical paths method [38]. In this method,we treat M as a directed graph with vertices O and edges E ¼ fð~pp;~qqÞAO� O: Qð~pp;~qqÞ40g; whereQð~pp;~qqÞ ¼ pað~ppÞ Pð~pp;~qqÞ: For each ordered pair ð~pp;~qqÞAO� O; we specify a canonical path g~pp;~qqAGfrom~pp to~qq in the graph ðO;EÞ that corresponds to a set of legal transitions in M from~pp to~qq: Wemeasure how heavily any one edge in E is loaded with canonical paths by
%r ¼ %rðGÞ ¼ maxeAE
1
QðeÞXg~pp ;~qq{e
pað~ppÞpað~qqÞ jg~pp;~qq j
8<:
9=;: ð5Þ
We start with a result from Sinclair [38], restated by Jerrum and Sinclair [16].
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 205
Theorem 4 (Jerrum and Sinclair [16], Sinclair [38]). Let M be a finite, reversible, ergodic Markov
chain with loop probabilities Pð~pp;~ppÞX1=2 for all ~pp: Let G be a set of canonical paths with maximumedge loading %r ¼ %rðGÞ: Then the mixing time of M satisfies t~ppðeÞp %rðln 1=pð~ppÞ þ ln 1=eÞ for any
choice of initial state ~pp; i.e. after simulating M for %rðln 1=pð~ppÞ þ ln 1=eÞ steps starting in ~pp; the
variation distance between #paiand pai
is at most e:
In general, O ¼Qn
i¼1 f0;y; kig for some integers k1;y; kn: Without loss of generality we let
O ¼ f0;y; kgn for some positive integer k: Then there is an edge from node~pp ¼ ðp1;y; pi;y; pnÞto ~pp0 ¼ ðp1;y; p0
i;y; pnÞ; i.e. an edge exists between each pair of nodes that differ in at most one
position (self-loops also exist). For our proof, we assume that the hypercube is untruncated, whichis necessary to ensure that no canonical paths leave the chain. However, it is likely that mixingtime bounds also exist for truncated hypercubes. Such a bound could probably be derived by therecent work of Morris and Sinclair [28] who give an FPRAS for a truncated Boolean hypercubethat has a uniform distribution.Let~pp ¼ ðp1;y; pnÞ and~qq ¼ ðq1;y; qnÞ be arbitrary states of O: The canonical path g~pp;~qq consists
of n edges, where edge i is
ððq1;y; qi1; pi; piþ1;y; pnÞ; ðq1;y; qi1; qi; piþ1;y; pnÞÞ;i.e. position i is changed from ~ppi to ~qqi: So some edges of g~pp;~qq might be loops. Now focus on a
particular oriented edge
e ¼ ð~aa;~aa0Þ ¼ ðða1;y; ai;y; anÞ; ða1;y; a0i;y; anÞÞ:We will now bound Eq. (5) for e; which yields a bound on %r and allows us to apply Theorem 4.Let cpðeÞ ¼ fð~pp;~qqÞ: g~pp;~qq{eg be the set of endpoints of canonical paths that use edge e: We use
Jerrum and Sinclair’s [16] mapping ~ZZe : cpðeÞ-O; defined13 as follows: if ð~pp;~qqÞ ¼ððp1;y; pnÞ; ðq1;y; qnÞÞAcpðeÞ; then
~ZZeð~pp;~qqÞ ¼ ðb1;y; bnÞ ¼ ðp1;y; pi1; ai; qiþ1;y; qnÞ:Note that ~pp ¼ ðb1;y; bi1; ai; aiþ1;y; anÞ and ~qq ¼ ða1;y; ai1; a0
i; biþ1;y; bnÞ: Since ~pp and ~qq can
be unambiguously recovered from ~ZZeð~pp;~qqÞ; the mapping ~ZZe is injective.We are now ready to state the mixing time bound.
Theorem 5. For all ~pp;~qqAO and for all eAO� O such that ð~pp;~qqÞAcpðeÞ; assume
pað~ppÞ pað~qqÞpg pað~aa0Þ pað~ZZeð~pp;~qqÞÞfor some function g ¼ gðn;K ; k; aÞ: Also assume that for all neighbors ~aa and ~aa0 in O;
maxfpað~aaÞ=pað~aa0Þ; pað~aa0Þ=pað~aaÞgph ¼ hðn;K ; k; aÞ:Then a simulation of M that starts at node ~pp and is of length
T ¼ 2kn2 g h lnWðaÞwa;~pp
� �þ lnð1=e0Þ
� �
will draw samples from #pa such that jj #pa pajjpe0:
ARTICLE IN PRESS
13Vector notation is used when denoting ~ZZe since ~ZZeð~pp;~qqÞAO for all ~pp;~qqAO:
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234206
Proof. Since QðeÞ ¼ minfpað~aaÞ; pað~aa0Þg=ð2knÞ; we get
pað~ppÞ pað~qqÞp2kn g QðeÞ
minfpað~aaÞ; pað~aa0Þg pað~aa0Þ pað~ZZeð~pp;~qqÞÞ
¼ 2kn g QðeÞmaxf1; pað~aa0Þ=pað~aaÞgpað~ZZeð~pp;~qqÞÞp 2kn g h QðeÞ pað~ZZeð~pp;~qqÞÞ:
Given the above inequality, we can now bound %r: Since jg~pp;~qq j ¼ n; we get
1
QðeÞXg~pp ;~qq{e
pað~ppÞ pað~qqÞ jg~pp;~qq jp2kn2ghXg~pp ;~qq{e
pað~ZZeð~pp;~qqÞÞp2kn2gh:
The last inequality holds because ~ZZe is injective and pa is a probability distribution. ApplyingTheorem 4 completes the proof. &
Corollary 6. For Markov chains for which g and h are polynomial in n and K (we assume k and a areconstants) and for approximation schemes for which b and 1=a are polynomial in n and K ; our
algorithm is an FPRAS.
4. Example applications
4.1. Learning DNF formulas
We consider generalized DNF representations, where the instance space isQn1
i¼0 f1;y; kig and
the set of terms isQn1
i¼0 f0;y; kig; where ki is the number of values for feature i: A term
~pp ¼ ðp0;y; pn1Þ is satisfied by example~xx ¼ ðx0;y;xn1Þ if and only if 8 pi40; pi ¼ xi: So pi ¼ 0implies that xi is irrelevant for term~pp; and pi40 implies that xi must equal pi for~pp to be satisfied.We present algorithms to learn this concept class that are based on Littlestone’s Winnow [21]
and Rosenblatt’s Perceptron [33] algorithms. The inputs to the linear threshold units learned bythese algorithms consist of the entire set of DNF terms over the original set of n inputs. Forreasons that will become clear, we look at different versions of the function pð~xxÞ; which measuresthe degree to which ~xx satisfies~pp: These versions include pð~xxÞ being a threshold function, a logisticfunction, and a linear function.None of our approaches in this section give complete, efficient solutions to the problem of
learning DNF in the on-line model, but they do give new mechanisms that could be refined forpotential application to restricted subclasses of learning DNF, e.g. restricted classes of DNF orspecific distributions over the examples.
4.1.1. WinnowRecalling the definition of Winnow in Section 2.1, if the initial weight vector is the all 1s vector,
the weight of term ~pp is w~pp ¼ az~pp ; where z~pp ¼P
~xxAM c~xx pð~xxÞ; M is the set of examples for which a
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 207
prediction mistake is made and c~xxAf1;þ1g is example ~xx’s label. We now state one ofLittlestone’s results for Winnow [23], which we will use to bound the number of predictionmistakes it makes for the variations of our algorithm.
Theorem 7 (Littlestone [23]). Let ð~YY j; cjÞA½0; 1�N � f0; 1g for j ¼ 1;y; t (t is the index of the
current trial). Suppose that there exist ~mmX0 and 0oro1 such that whenever cj ¼ 1 we have ~mm �~YY jX1 and whenever cj ¼ 0 we have~mm � ~YY jp1 r: Now suppose Winnow sees as inputs X ¼ ð~XX j; cjÞwhere each XijA½0; 1�; and define ~EEj ¼ ðjXj1 Yj1j;y; jXjN YjN jÞ: Then the number of mistakes
made by Winnow on X with a ¼ 1þ r=2 and y ¼ N is at most
8=r2 þmax 0;14
r2XN
i¼1
milnðmiyÞ !
þ 4
r
Xt
j¼1
~mm � ~EEj:
We will examine three versions of our algorithm, differing in the values that Winnow receives asits inputs. In the following, we say that a variable xi in example ~xx matches its correspondingvariable pi in term~pp if pi ¼ 0 or pi ¼ xi: We let m~xx;~ppAf0;y; ng denote the number of variables in
~xx that match their corresponding variables in ~pp (we drop the subscript ~xx when it is clear fromcontext).
(1) Binary pð~xxÞ means that Winnow input X~pp ¼ 1 if m~pp ¼ n and 0 otherwise.
(2) Logistic pð~xxÞ means that Winnow input X~pp is
pð~xxÞ ¼ 2
1þ esðm~ppnÞ; ð6Þ
where s40 is a parameter. Thus pð~xxÞAð0; 1� and grows as ~pp becomes more satisfied by ~xx (itequals 1 if and only if it is completely satisfied).
(3) Linear pð~xxÞ means that Winnow input X~pp is
pð~xxÞ ¼ 1þ m~pp
n þ 1: ð7Þ
Thus pð~xxÞAð0; 1� and grows as ~pp becomes more satisfied by ~xx (it equals 1 if and only if it iscompletely satisfied).
We defined pð~xxÞ40 for all ~xx in order to ensure that for a given ~xx; every term in O contributessomething to the weighted sum, and the hypercube is untruncated, which allows us to applyTheorem 5. Using binary pð~xxÞ also yields an untruncated hypercube, as explained below.Before considering these three cases individually, we state some common results for them all.
First note that for logistic and linear pð~xxÞ; O consists of the entire set of possible terms, since eachterm gives to Winnow a value pð~xxÞ40: Thus in the chain defined in Section 3, every state can bereached from every other state. Further, for binary pð~xxÞ; we note that there are exactly 2n termsthat are satisfied by ~xx; i.e. ~pp is satisfied by ~xx if and only if pi ¼ 0 or pi ¼ xi for all iAf1;y; ng:Thus in this case we can construct the state space to be O ¼ f0; 1gn; which is completely connectedand untruncated. Therefore it is obvious that Lemma 1 applies to all our Markov chains. We nowdiscuss the application of Theorem 2. All that is required to apply this result is to bound the range
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234208
of f and f: Since f and f are independent of pð~xxÞ; the same result applies to all three versions of ouralgorithm.
Lemma 8. When applying binary, logistic, or linear Winnow to learn DNF, for all i; 1=epfi; fipe:
Proof. First note that the only difference between fi and fi is the probability distribution thatgenerates the terms that define them, i.e. their ranges are the same. Thus we focus on bounding fi
only. Let z~pp be node ~pp’s total update as defined in Section 3. Then
fið~ppÞ ¼wai1;~pp
wai;~pp¼ ai1
ai
� �z~pp
¼ ð1þ 1=zÞi2
ð1þ 1=zÞi1
!z~pp
¼ ð1þ 1=zÞz~pp :
Recall that from its definition, zXBXjz~pp j for all ~pp (to avoid division by zero, we can also assume
that z40). If z~ppo0; then 1pð1þ 1=zÞz~pppe: If z~ppX1; then 1=epð1þ 1=zÞz~ppp1: &
Note that Wða1Þ ¼ Wð1Þ is simplyP
~ppAO pð~xxÞ: For the binary case, this is simply the number of
terms satisfied by ~xx; which equals 2n: For the linear and logistic cases, it can be efficientlycomputed exactly if we assume that ki ¼ k for all i: Under this assumption, the number of terms
that match exactly iAf0;y; ng variables in the example ~xx is 2iðniÞðk 1Þni; since there are ðn
iÞ
positions to place the matched variables, each matched position pj can equal 0 or xj; and each
unmatched position pj0 can take on any value from f1;y; kg\fxj0g: Thus for the logistic case, weget
Wð1Þ ¼Xn
i¼0
2in
i
� �2ðk 1Þni
1þ esðinÞ; ð8Þ
and for the linear case, we get
Wð1Þ ¼Xn
i¼0
2in
i
� �ði þ 1Þðk 1Þni
n þ 1¼ ðk þ 1Þn þ 2nðk þ 1Þn1
n þ 1: ð9Þ
Thus all three can efficiently be computed exactly. By applying this and substituting Lemma 8’sbounds into Theorem 2, we get the following.
Corollary 9. When applying Winnow to learn generalized DNF (with ki ¼ k for all i for the logistic
and linear cases), let the sample size S ¼ J130 r e2=e2n and M be simulated long enough for each
sample such that the variation distance between the empirical distribution and pai;tis at most e=ð5e2rÞ:
Then for any d40; WðaÞ satisfies
Pr½ð1 eÞWðaÞpWðaÞpð1þ eÞWðaÞ�X1 d:
As stated in the following corollary, our algorithms’ behaviors are the same as Winnow’s for astraightforward brute-force implementation if the weighted sums are not too close to y for any
input. This is true with probability at least 1 d0t for each trial t; so setting d0t ¼ d=2t yields a total
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 209
probability of failure of at most14P
N
t¼1 d=2t ¼ d: Finally, note that it is easy to extend the
corollary to tolerate a bounded number of trials with weighted sums that are near y by thinking ofsuch potential mispredictions as noise and applying Theorem 7.
Corollary 10. Using the assumptions of Corollary 9, if WtðaÞe½y=ð1þ eÞ; y=ð1 eÞ� for all trials t;then with probability at least 1 d; the number of mistakes made by Winnow on any sequence ofexamples is as bounded by Theorem 7 (see Sections 4.1.1.1–4.1.1.3 and Lemma 11).
A hurdle that must be overcome to get an efficient algorithm is S’s polynomial dependenceon 1=e in Corollary 9, even though Winnow might at times have W=y exponentially close15
to 1, requiring exponentially small e: It is open whether this can be addressed in anaverage-case analysis of Winnow when learning restricted concept classes under specificdistributions.We now explore bounding the mixing times of the Markov chains. Note that the bounds are
based on worst-case analyses and assume that the maximum number of weight updates (asbounded by Theorem 7’s mistake bound) have been made. Prior to making that number ofupdates (e.g. near the start of training), the mixing time bounds will be lower since the stationarydistribution p ofM will be closer to uniform (in fact, before the first update, p is uniform). We canget mixing time bounds for these earlier cases by substituting the number of prediction mistakesmade so far for the mistake bounds.
4.1.1.1. Binary pð~xxÞ. It is straightforward to apply Theorem 7 to the binary case. Let the vector~mmbe 0 for each irrelevant term and 1 for each relevant term. Then when c ¼ 1; at least one relevant
term must be satisfied, so ~mm � ~YYX1: Further, if c ¼ 0; then no relevant terms are satisfied and
~mm � ~YY ¼ 0p1 r for r ¼ 1: Assuming all examples are noise-free, applying Theorem 7 yields a
mistake bound of jMjp8þ 14K ln N: So if kXki for all i; then using the at most ðk þ 1Þn possibleterms as Winnow’s inputs, it can learn K-term generalized DNF with at most 8þ 14Kn lnðk þ 1Þprediction mistakes.Unfortunately, with the binary case it is very difficult to find non-trivial bounds on g and h from
Theorem 5, due to the discontinuity of pð~xxÞ: Bounding both g and h requires bounding the ratiosof the weights of nodes in O: For binary pð~xxÞ; these weights directly depend on how often thenodes predicted 1 when a prediction mistake was made, but it is difficult to relate how often thisoccurs for a node~pp to how often this occurs for another node~qq; even if~pp and~qq are neighbors. Onthe other hand, when we consider logistic and linear pð~xxÞ; we can relate the nodes’ weights and getnon-trivial bounds on g and h:
4.1.1.2. Logistic pð~xxÞ. The mistake bound of this application of Winnow is similar to that of thestraightforward version with binary inputs.
ARTICLE IN PRESS
14Recall from the proof of Theorem 2 that only Oðlog 1=d0Þ runs of the estimation procedure are needed to reduce the
probability of failure to d0:15The potential problem of Wt=y ¼ 1 can be avoided by using a threshold of yþ aðjMjþ1Þ; where jMj is the mistake
bound from applying Theorem 7. Obviously W can never equal this new threshold.
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234210
Lemma 11. When using Equation 6 with s ¼ lnð60Kn ln kÞ to specify the inputs, where k ¼maxi fkig; the number of prediction mistakes made by Winnow when learning DNF is at most
8:88þ 15:54Kn ln k:
Proof. We start by finding ~mm and r that satisfy the conditions of Theorem 7. For each of the Krelevant terms (Winnow inputs), set the corresponding value in ~mm to a constant m; which we willdefine later. Set all other values in ~mm to 0. In the worst case, when an example ~xx is positive, itsatisfies exactly one relevant term ~pp and does not at all satisfy any of the other relevant terms.Then pð~xxÞ ¼ 1 and qð~xxÞ ¼ 2=ð1þ esnÞ for all other relevant terms ~qq: Thus it suffices to set m suchthat
mþ 2ðK 1Þm1þ esn
X1:
After some algebra we see that it suffices to set m ¼ ð1þ esnÞ=ð2K þ esn 1Þ:Now we find r: In the worst case, for a negative example ~xx; we will have each relevant term ~pp
almost fully satisfied, i.e. m~pp ¼ n 1: Hence pð~xxÞ ¼ 2=ð1þ esÞ: So we need r such that 1rX2Km=ð1þ esÞ: Substituting m yields
rp1 2Kð1þ esnÞ2Kð1þ esÞ þ esnð1þ esÞ es 1
:
This expression decreases with increasing n; so to find an appropriate r; it suffices to take its limitas n-N: Applying l’Hopital’s rule shows that r ¼ 1 2K=ð1þ esÞ is sufficient, which is positive
so long as s4lnð2K 1Þ: We assume that all examples are noise-free, so ~EEj ¼~00 for all j: Now
applying Theorem 7 yields a mistake bound of
ð1þ esÞ2
ð1þ esÞ2 4Kðes þ 1 KÞ
!
� 8þ 14K1þ esn
2K þ esn 1
� �ln
1þ esn
2K þ esn 1
� �þ ln N
� �� �
pð1þ esÞ2
ð1þ esÞ2 4Kð1þ esÞ
!ð8þ 14K lnNÞ
¼ 1þ es
1þ es 4K
� �ð8þ 14K ln NÞp1:11ð8þ 14K ln NÞ;
since K ; nX1 and kX2: Noting that Npkn completes the proof. &
We now work towards a mixing time bound for the chain.
Lemma 12. Let O ¼ f0;y; kgn: Then for all ~pp;~qqAO;
pað~ppÞpað~qqÞp4ajMjpað~aa0Þpað~ZZeð~pp;~qqÞÞ:
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 211
Proof. Let ~qq1yi denote ðq1;y; qiÞ; and similarly for ~pp1yi: Then m~pp ¼ m~pp1yiþ m~ppiþ1yn
; m~qq ¼m~qq1yi
þ m~qqiþ1yn; m~ZZeð~pp;~qqÞ ¼ m~pp1yi
þ m~qqiþ1yn; and m~aa 0 ¼ m~qq1yi
þ m~ppiþ1yn: Further, all four of these
values are in f0;y; ng: This yields
pað~ppÞpað~qqÞpað~aa0Þ pað~ZZeð~pp;~qqÞÞ
¼ pð~xxÞqð~xxÞaz~ppþz~qq
~ZZeð~pp;~qqÞð~xxÞ~aa0ð~xxÞaz~ZZeð~pp ;~qqÞþz~aa 0
¼ az~ppþz~qq
az~ZZeð~pp ;~qqÞþz~aa 0
� �1þ Uð~pp1yi;~qqiþ1yn; nÞ þ Uð~qq1yi;~ppiþ1yn; nÞ þ Uð~pp1yn;~qq1yn; 2nÞ1þ Uð~pp1yi;~ppiþ1yn; nÞ þ Uð~qq1yi;~qqiþ1yn; nÞ þ Uð~pp1yn;~qq1yn; 2nÞ
� �o4az~ppþz~qqz~ZZeð~pp ;~qqÞz~aa 0 ;
where Uð~ppiyj;~qqi0yj0 ; nÞ ¼ expðsðm~ppiyjþ m~qqi0yj0 nÞÞ: The last inequality follows from each term
in the numerator being strictly less than the entire denominator. Now let C be the exponent of thea term. Then we have16
C ¼ z~pp þ z~qq z~ZZeð~pp;~qqÞ z~aa 0
¼X~xxAM
2c~xx1þ expðsðm~xx;~pp1yi
þ m~xx;~ppiþ1yn nÞÞ
þ 2c~xx1þ expðsðm~xx;~qq1yi
þ m~xx;~qqiþ1yn nÞÞ
2c~xx1þ expðsðm~xx;~pp1yi
þ m~xx;~qqiþ1yn nÞÞ
2c~xx1þ expðsðm~xx;~qq1yi
þ m~xx;~ppiþ1yn nÞÞ:
Each term of the above summation is between 1 and 1; so a worst-case upper bound is jMj: &
Lemma 13. For all neighbors ~pp and ~qqAO;
maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgpajMj 60Kn ln k:
Proof. Since ~pp and ~qq are neighbors, they only differ in one position, so jm~pp m~qq jp1: Then
pað~ppÞpað~qqÞ
¼ pð~xxÞqð~xxÞ
� �az~ppz~qq ¼ 1þ expðs n s m~qqÞ
1þ expðsn sm~ppÞ
� �az~ppz~qq
p1þ expðs n s ðm~pp 1ÞÞ
1þ expðs n s m~ppÞ
� �az~ppz~qq
¼ 1þ es expðs n s m~ppÞ1þ expðs n s m~ppÞ
� �az~ppz~qq
ARTICLE IN PRESS
16When the subscript ~xx is omitted from m; then m counts the number of matches with the current example. In the
summations over ~xxAM; m~xx represents the number of matches with example ~xx:
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234212
p1þ expðs n s m~ppÞ1þ expðsn s m~ppÞ
� �esaz~ppz~qq ¼ esaz~ppz~qq :
We now consider z~pp z~qq ; which equals
X~xxAM
2c~xx1þ expðs n s m~xx;~ppÞ
2c~xx1þ expðs n s m~xx;~qqÞ
¼ 2X~xxAM
c~xxexpðsn sm~xx;~qqÞ expðsn sm~xx;~ppÞ
1þ expðsn sm~xx;~qqÞ þ expðsn sm~xx;~ppÞ þ expð2sn sðm~xx;~pp þ m~xx;~qqÞÞ
� �
p 2X~xxAM
c~xxexpðsn sm~xx;~pp þ sÞ expðsn sm~xx;~ppÞ
1þ expðsn sm~xx;~pp þ sÞ þ expðsn sm~xx;~ppÞ þ expð2sn sðm~xx;~pp 1þ m~xx;~ppÞÞ
� �
¼ 2X~xxAM
c~xxexpðsn sm~xx;~ppÞðes 1Þ
1þ expðsn sm~xx;~ppÞðes þ 1Þ þ es expð2sðn m~xx;~ppÞÞ
� �
p 2X~xxAM
c~xx1þ expðsðn m~xx;~ppÞÞ
pjMj;
where the third line follows from the fact that there are only three ways to relate m~xx;~pp and m~xx;~qq for
a specific ~xx: If m~xx;~pp ¼ m~xx;~qq þ 1; then that term of the summation equals the bound. If m~xx;~pp ¼m~xx;~qq 1; then that term of the summation is negative, which is less than the bound. If m~xx;~pp ¼m~xx;~qq ; then that term of the summation is 0, which is less than the bound.
Finally, we note that a symmetric argument can be made for pað~qqÞ=pað~ppÞ: &
We now apply Theorem 5 to bound the mixing time of this Markov chain.
Corollary 14. When learning generalized DNF using Winnow and a logistic pð~xxÞ; a simulation of Mthat starts at any node and is of length
Ti ¼ 480kn3a1þ2jMji K ln kðn ln k þ 2jMj ln ai þ lnð1=e0ÞÞ
(where jMj is the number of prediction mistakes made so far) will draw samples from #paisuch that
jj #pai pai
jjpe0:
Proof. Lemmas 12 and 13 bound g and h; which we substitute directly into Theorem 5. Also, note
that WðaiÞpknajMji and wai;~ppXajMj
i ; completing the proof. &
Before many prediction mistakes have been made, our algorithm can quickly generate randomsamples from M almost according to its stationary distribution p: Unfortunately, our worst-casemixing time bound of this chain (once jMj approaches Winnow’s mistake bound) is exponential inn and K : Indeed, a straightforward brute-force means to compute the sum of the weights can be
done in YðknÞ time, whereas our bound of Ti grows with a1þ2ð8:88þ15:54Kn ln kÞi Xk31Kn ln aiXk12:5n
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 213
when ai ¼ a ¼ 3=2 (a popular value for a). However, Corollary 14 is based on worst-case,adversary-based analyses. In particular, the proofs of Lemmas 12 and 13 both bound theexponents of the a terms with jMj even though it is possible that they are much smaller. (Forexample, in Lemma 13’s proof, the terms in the summation of z~pp z~qq are exponentially small
when m~xx;~pp is small, which can occur frequently for terms~pp with few zeroes. Also, we assumed that
each term of the summation was positive, even though several could be negative.) It is openwhether sub-exponential bounds can be achieved by applying a different analysis to some specialcases of restricted concept classes and distributional assumptions. Further, in Section 5 we showthat in practice, our algorithm performs much better than the worst-case theoretical results imply,especially considering that highly accurate estimates of the weighted sums are not needed so longas we know which side of the threshold the sum lies on.
4.1.1.3. Linear pð~xxÞ. Applying Theorem 7 to the linear case is not as straightforward as it was forthe logistic case. As in the proof of Lemma 11, we set the entries in ~mm corresponding to relevantterms to m and the remaining entries to 0. When c ¼ 1; in the worst case exactly one relevant termmatches all n variables in ~xx and the remaining K 1 relevant terms match 0. Then we get
~mm � ~YY ¼ m 1þ K 1
n þ 1
� �¼ m
n þ K
n þ 1
� �;
which has to beX1: When c ¼ 0; the worst case has all K relevant terms matching n 1 variablesof ~xx; yielding
~mm � ~YY ¼ mnK
n þ 1
� �Xm
n þ K
n þ 1
� �for n;KX2: Thus it is impossible to get a r40 for the linear case unless we treat such worst-case
examples as noise and use a non-zero ~EE : Following this idea, we assume that when c ¼ 1; allrelevant inputs to Winnow are at least g1=ðn þ 1Þ (i.e. all relevant inputs match at least g1 1variables in ~xx), with of course at least one such input ¼ 1: Further, we assume that when c ¼ 0; allrelevant inputs are at most g0=ðn þ 1Þ: Then it is easy to show that setting
m ¼ n þ 1
n þ 1þ g1ðK 1Þand
r ¼ n þ 1þ g1ðK 1Þ g0Kn þ 1þ g1ðK 1Þ
satisfies the conditions of Theorem 7. For example, if g1 ¼ 3n=4 and g0 ¼ n=2; then m ¼ð4n þ 4Þ=ðn þ 3nK þ 4Þ; r ¼ ðn þ nK þ 4Þ=ðn þ 3nK þ 4ÞX1=3; and Theorem 7’s mistakebound is
jMjp 72þ 1264Kðn þ 1Þ
n þ 3Kn þ 4
� �ln
4Nðn þ 1Þn þ 3Kn þ 4
� �þ 36
Xt
j¼1
~mm � ~EEj
p 72þ 252 ln N þ 36Xt
j¼1
~mm � ~EEjp72þ 252n ln k þ 36Xt
j¼1
~mm � ~EEj; ð10Þ
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234214
if nX2: Of course, this is of little value as an adversarial bound, since the third term is summedover all trials and an adversary could make each term of this summation positive. But a bound ofthis form might be useful under appropriate distributional assumptions.We now bound the mixing time for the linear case.
Lemma 15. Let O ¼ f0;y; kgn: Then for all ~pp;~qqAO;
pað~ppÞpað~qqÞpð1þ n=2Þ2
n þ 1pað~aa0Þpað~ZZeð~pp;~qqÞÞ:
Proof. Using the same notation introduced in the first paragraph of Lemma 12’s proof, we get
pað~ppÞpað~qqÞpað~aa0Þ pað~ZZeð~pp;~qqÞÞ
¼ pð~xxÞqð~xxÞaz~ppþz~qq
~ZZeð~pp;~qqÞð~xxÞ~aa0ð~xxÞaz~ZZeð~pp ;~qqÞþz~aa 0
¼ az~ppþz~qq
az~ZZeð~pp ;~qqÞþz~aa 0
� � ð1þ m~pp1yiþ m~ppiþ1yn
Þð1þ m~qq1yiþ m~qqiþ1yn
Þð1þ m~pp1yi
þ m~qqiþ1ynÞð1þ m~qq1yi
þ m~ppiþ1ynÞ
� �
paz~ppþz~qqz~ZZeð~pp ;~qqÞz~aa 0ð1þ n=2Þ2
n þ 1
!:
The last inequality holds since the second term is maximized by setting m~pp1yi¼ m~qqiþ1yn
¼ 0 and
m~qq1yi¼ m~ppiþ1yn
¼ n=2: Now let C be the first term’s exponent:
C ¼ z~pp þ z~qq z~ZZeð~pp;~qqÞ z~aa 0
¼X~xxAM
c~xx1þ m~xx;~pp1yi
þ m~xx;~ppiþ1ynþ 1þ m~xx;~qq1yi
þ m~xx;~qqiþ1yn
n þ 1
�
1þ m~xx;~pp1yiþ m~xx;~qqiþ1yn
þ 1þ m~xx;~qq1yiþ m~xx;~ppiþ1yn
n þ 1
�¼ 0: &
Lemma 16. For all neighbors ~pp and ~qqAO;
maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgp2ajMj=ðnþ1Þ:
Proof. Since ~pp and ~qq are neighbors, they only differ in one position, so jm~pp m~qq jp1: Then
pað~ppÞpað~qqÞ
¼ pð~xxÞqð~xxÞ
� �az~ppz~qq ¼ 1þ m~pp
1þ m~qq
� �az~ppz~qqp
2þ m~qq
1þ m~qq
� �az~ppz~qqp2az~ppz~qq :
We now consider z~pp z~qq :
z~pp z~qq ¼ 1
n þ 1
X~xxAM
c~xxð1þ m~xx;~pp 1 m~xx;~qqÞpjMj=ðn þ 1Þ:
Finally, we note that a symmetric argument can be made for pað~qqÞ=pað~ppÞ: &
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 215
Corollary 17. When learning generalized DNF using Winnow and a linear pð~xxÞ; a simulation of Mthat starts at any node and is of length
Ti ¼ 4knðn2=4þ n þ 1ÞajMj=ðnþ1Þi ðn ln k þ 2jMj ln ai þ lnð1=e0ÞÞ
will draw samples from #paisuch that jj #pai
paijjpe0:
Proof. Lemmas 15 and 16 bound g and h; which we substitute directly into Theorem 5. Also, note
that WðaiÞpknajMji and wai;~ppXajMj
i ; completing the proof. &
Note that if jMj ¼ Oðn log KÞ (either if Winnow is in the early stages of learning or if the thirdterm of Eq. (10) is Oðn log KÞ), then the chain’s mixing time is polynomial in all relevantparameters if k (the number of values each variable can take on) is a constant, yielding an FPRASunder the conditions of Corollary 9.
4.1.2. PerceptronWe now consider applying our technique to Rosenblatt’s Perceptron [33] algorithm. The
purpose of our analysis is primarily for contrast to the Winnow case since17 Khardon et al. [18]give a kernel function to efficiently exactly compute the weighted sums when applying Perceptronto learning DNF. But they also give an exponential lower bound on the number of mistakes that
kernel perceptron makes in learning DNF: 2OðnÞ: Thus their results do not imply an efficient DNF-learning algorithm.We refer the reader to Section 2.1 for an overview of the Perceptron algorithm. Recall from
Section 3 that if the initial weight vector is the all 0s vector, then term ~pp’s weight is w~pp ¼ a z~pp ;
where z~pp ¼P
~xxAM c~xxpð~xxÞ; M is the set of examples for which a prediction mistake is made and
c~xxAf1;þ1g is example ~xx’s label.Since our technique is only capable of estimating positive functions, we cannot allow the
Perceptron’s weights to be negative. Thus to each weight in ‘‘standard’’ Perceptron, we add apositive constant c; yielding a new weight for term ~pp of w~pp ¼ c þ az~pp ; where c ¼3ajMjX3amax~ppAOfjz~pp jg: The dot product of this new weight vector with the Perceptron inputs
will then be compared to the new threshold c:We use the same definitions and algorithm (given in Section 3) that were used for Winnow,
except the ‘‘base’’ value of ai is now a0 ¼ 0 rather than a1 ¼ 1: So our weight estimate is the sameas given in Eq. (1), but the product runs from i ¼ 1 to r rather than starting at i ¼ 2; and wemultiply it by Wða0Þ: As with Winnow, this latter quantity is easily computed exactly: For binarypð~xxÞ; Wða0Þ ¼ c2n: For logistic pð~xxÞ; if ki ¼ k for all i; we get
Wð0Þ ¼ cXn
i¼0
2in
i
� �2ðk 1Þni
1þ esðinÞ;
ARTICLE IN PRESS
17Note, however, that other applications of Perceptron for which no kernels are available might be amenable to an
MCMC-based approach to estimate the dot products.
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234216
and for the linear case, we get
Wð0Þ ¼ cXn
i¼0
2in
i
� �ði þ 1Þðk 1Þni
n þ 1¼ cðk þ 1Þn þ 2cnðk þ 1Þn1
n þ 1:
Thus all three can efficiently be computed exactly.
We now make the following argument about fi and fi for all three versions of pð~xxÞ:
Lemma 18. When applying binary, logistic, or linear Perceptron to learn DNF, for all i;
2=3pfi; fip2:
Proof. We focus on the actual random variables, since the estimates (the ‘‘hat’’ variables) have thesame range. Since these variables for all three versions have pð~xxÞ in the numerator anddenominator, they all equal
fið~ppÞ ¼c þ z~pp ai1
c þ z~ppai
:
If z~ppX0; then (since ai1oai) obviously fið~ppÞp1: Also, since cX2az~pp and aXai;
fið~ppÞXc þ z~pp ai1
c þ cai=ð2aÞX
c þ z~ppai1
3c=2¼ 2
31þ z~pp ai1
c
� �X2=3:
If z~ppo0; then obviously fið~ppÞ41: Also,
fið~ppÞpc þ z~pp ai1
c=2¼ 2þ 2z~pp ai1
cp2:
Thus fið~ppÞA½2=3; 2� for all i: &
This leads to the following corollary.
Corollary 19. When applying Perceptron to learn generalized DNF (with ki ¼ k for all i for the
logistic and linear cases), let the sample size S ¼ J390r=e2n and M be simulated long enough foreach sample such that the variation distance between the empirical distribution and pai
is at most
e=ð15rÞ: Then for any d40; WðaÞ satisfies
Pr½ð1 eÞWðaÞpWðaÞpð1þ eÞWðaÞ�X1 d:
To bound the number of prediction mistakes the Perceptron algorithm makes in learning DNF,we apply a result from Gentile and Warmuth [12].
Theorem 20 (Gentile and Warmuth [12]). Let ð~YY j; cjÞA½0; 1�N � f0; 1g for j ¼ 1;y; t; let ~mm be an
arbitrary weight vector of dimension N; and let M be the set of examples on which the Perceptron
algorithm makes a prediction mistake. Then the number of mistakes made by the Perceptron
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 217
algorithm is
jMjp jj~mmjj2 rg~mm;M
!2
;
where jj � jj2 is the 2-norm, rXjj~YY jj2 for all ~YYAM; and
g~mm;M ¼ 1
jMjX~YY jAM
cj~mm � ~YY j
is the average margin of ~mm:
4.1.2.1. Binary pð~xxÞ. Applying Theorem 20 to the binary case is straightforward. We let~mm be 0 inall places except those corresponding to a relevant attribute, which are set to 1 (so there are K 1sin ~mm). In addition, we add an ðN þ 1Þth position to ~mm; setting it to 1=2: This position will
correspond to a 1 added to each example seen by the perceptron. Thus we get jj~mmjj2 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiK þ 1=4
p:
Since all Perceptron inputs are from f0; 1g for the binary case, we get g~mm;MX1=2: Further, since
exactly 2n þ 1 inputs are 1 for each example, r ¼ffiffiffiffiffiffiffiffiffiffiffiffiffi2n þ 1
p: Applying Theorem 20 yields
jMjpð4K þ 1Þð2n þ 1Þ:As with binary pð~xxÞ with Winnow, binary pð~xxÞ with Perceptron is difficult to analyze to provide
non-trivial bounds on the mixing time. Thus we look at the logistic and linear cases.
4.1.2.2. Logistic pð~xxÞ. We begin by bounding the number of mistakes logistic Perceptron willmake.
Lemma 21. When using Equation 6 with s ¼ lnð60Kn ln kÞ to specify the inputs, the number ofprediction mistakes made by Perceptron when learning DNF is at most
ð20K þ 5Þð2þ k=ð3600 ln2 kÞÞn:
Proof. We use the same ~mm as we did for the binary case, namely 1s at the K relevant positions,1=2 matching the extra 1 added to each example, and 0s elsewhere. We now bound g~mm;M : If
cj ¼ þ1 for some trial j; then at least one of the relevant terms ~pp must send a 1 input to
Perceptron. Thus cj~mm � ~YY jX1 1=2 ¼ 1=2; where ~YY j is the vector of inputs to Perceptron (i.e. the
outputs of the pð�Þ functions). If cj ¼ 1; then in the worst case each relevant term will be almost
completely satisfied by the input example ~xxj; i.e. all but one variable in each relevant term will be
satisfied. If this happens, then the total contribution to ~mm � ~YY j that comes from the K relevant
terms is 2K=ð1þ esÞ: Adding this to the extra 1=2 and multiplying by cj ¼ 1 yields a worst-
case bound of
g~mm;MXcj ~mm � ~YY jX1
2 2K
1þ es¼ 1
2 2K
1þ 60Kn ln kX1
2 1
30 ln 2X0:45;
since nX1 and kX2:
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234218
We now bound r: Recall that Eq. (8) sums the pð~xxÞ values for the entire set of terms. By
substituting ðpð~xxÞÞ2 for pð~xxÞ in this equation and taking the square root, we exactly get the 2-normfor any input to Perceptron:
jj~YY jj2 ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn
i¼02i
n
i
� �4ðk 1Þni
ð1þ esðinÞÞ2
sp
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn
i¼0
2in
i
� �4kni
e2sðinÞ
s
¼ 2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn
i¼0
2in
i
� �knið60Kn ln kÞ2i2n
s
¼ 2kn=2
ð60Kn ln kÞn
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn
i¼0
ðð7200=kÞn2K2ln2kÞi n
i
� �s
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4
ð7200=kÞn2K2 ln2 k þ 1
ð3600=kÞn2K2 ln2 k
n !vuut o2 ð2þ k=ð3600ln2kÞÞn=2;
since n;KX1: Thus setting r ¼ 2 ð2þ k=ð3600ln2kÞÞn=2 suffices. &
We now bound the mixing time for the Markov chain.
Lemma 22. Let O ¼ f0;y; kgn: Then for all ~pp;~qqAO;
pað~ppÞpað~qqÞp16pað~aa0Þ pað~ZZeð~pp;~qqÞÞ:
Proof. As in the proof of Lemma 12, let ~qq1yi denote ðq1;y; qiÞ; and similarly for ~pp1yi: Thenm~pp ¼ m~pp1yi
þ m~ppiþ1yn; m~qq ¼ m~qq1yi
þ m~qqiþ1yn; m~ZZeð~pp;~qqÞ ¼ m~pp1yi
þ m~qqiþ1yn; and m~aa 0 ¼ m~qq1yi
þ m~ppiþ1yn:
Further, all four of these values are in f0;y; ng: This yields
pað~ppÞ pað~qqÞpað~aa0Þ pað~ZZeð~pp;~qqÞÞ
¼ pð~xxÞqð~xxÞ ðc þ a z~ppÞðc þ a z~qqÞ~ZZeð~pp;~qqÞð~xxÞ~aa0ð~xxÞ ðc þ a z~ZZeð~pp;~qqÞÞðc þ a z~aa 0 Þ
¼ ðc þ a z~ppÞðc þ a z~qqÞðc þ a z~ZZeð~pp;~qqÞÞðc þ a z~aa 0 Þ
� �
� 1þ Uð~pp1yi;~qqiþ1yn; nÞ þ Uð~qq1yi;~ppiþ1yn; nÞ þ Uð~pp1yn;~qq1yn; 2nÞ1þ Uð~pp1yi;~ppiþ1yn; nÞ þ Uð~qq1yi;~qqiþ1yn; nÞ þ Uð~pp1yn;~qq1yn; 2nÞ
� �
o 4ðc þ a z~ppÞðc þ a z~qqÞ
ðc þ a z~ZZeð~pp;~qqÞÞðc þ a z~aa 0 Þ
� �;
where Uð~ppiyj;~qqi0yj0 ; nÞ ¼ expðsðm~ppiyjþ m~qqi0yj0 nÞÞ: The last inequality comes directly from
the proof of Lemma 12. We now bound the second term. Since pð~xxÞp1; each ða zÞ summation is
upper bounded by ajMj; so the numerator is at most ðc þ ajMjÞ2: Meanwhile, the denominator is
at least ðc ajMjÞ2: Thus since c ¼ 3ajMj; the second term is at most 4: &
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 219
Lemma 23. For all neighbors ~pp and ~qqAO;
maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgp120Kn ln k:
Proof. Since ~pp and ~qq are neighbors, they only differ in one position, so jm~pp m~qq jp1: Then
pað~ppÞpað~qqÞ
¼ pð~xxÞqð~xxÞ
� �c þ a z~pp
c þ a z~qq
� �¼ 1þ expðs n s m~qqÞ
1þ expðs n s m~ppÞ
� �c þ a z~pp
c þ a z~qq
� �
p1þ expðs n s ðm~pp 1ÞÞ
1þ expðs n s m~ppÞ
� �c þ a jMjc a jMj
� �
¼ 1þ es expðs n s m~ppÞ1þ expðs n s m~ppÞ
� �c þ a jMjc a jMj
� �
p1þ expðs n s m~ppÞ1þ expðs n s m~ppÞ
� �2es ¼ 120Kn ln k;
since s ¼ lnð60Knln kÞ: Finally, we note that a symmetric argument can be made forpað~qqÞ=pað~ppÞ: &
We now apply Theorem 5 to bound the mixing time of this Markov chain.
Corollary 24. When learning generalized DNF using Perceptron and a logistic pð~xxÞ; a simulation of
M that starts at any node and is of length
Ti ¼ 3840kKn3 ln kðn ln k þ lnð4aijMjÞ þ lnð1=e0ÞÞ
will draw samples from #paisuch that jj #pai
paijjpe0:
Proof. Lemmas 22 and 23 bound g and h; which we substitute directly into Theorem 5. Also, notethat WðaiÞpknðc þ ai jMjÞ and wai;~ppX1; completing the proof. &
4.1.2.3. Linear pð~xxÞ. We begin by bounding the number of mistakes Perceptron will make.However, like the linear Winnow case, worst-case (adversary) bounds are not possible since theaverage margin could be forced to be negative. Thus we assume that the examples are such thatmost of them are linearly separable and have a positive average margin g~mm;M :
Lemma 25. When using Equation 7 to specify the inputs, and if the average margin g~mm;M of the
sequence of examples is positive, then the number of prediction mistakes made by Perceptron whenlearning DNF is at most
5Kððk þ 1Þn þ 6nðk þ 1Þn1 þ 4nðn 1Þðk þ 1Þn2Þ4g2~mm;Mðn þ 1Þ2
:
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234220
Proof. We use the same ~mm as we did for the binary and logistic cases, and hence get the same 2-norm for this vector as before. To bound r; recall that Eq. (9) sums the pð~xxÞ values for the entireset of terms. By substituting ðpð~xxÞÞ2 for pð~xxÞ in this equation and taking the square root, weexactly get the 2-norm for any input to Perceptron:
jj~YY jj2 ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn
i¼02i
n
i
� �ðk 1Þni i þ 1
n þ 1
� �2s
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðk þ 1Þn þ 6nðk þ 1Þn1 þ 4nðn 1Þðk þ 1Þn2
qn þ 1
;
which completes the proof. &
Now we bound the mixing time.
Lemma 26. Let O ¼ f0;y; kgn: Then for all ~pp;~qqAO;
pað~ppÞ pað~qqÞp4ð1þ n=2Þ2
n þ 1pað~aa0Þpað~ZZeð~pp;~qqÞÞ:
Proof. Using the same notation introduced in the first paragraph of Lemma 22’s proof,we get
pað~ppÞ pað~qqÞpað~aa0Þ pað~ZZeð~pp;~qqÞÞ
¼ pð~xxÞ qð~xxÞ~ZZeð~pp;~qqÞð~xxÞ~aa0ð~xxÞ
� � ðc þ a z~ppÞðc þ a z~qqÞðc þ a z~ZZeð~pp;~qqÞÞðc þ a z~aa 0 Þ
� �
pð1þ n=2Þ2
n þ 1
!� 4;
using results from the proofs of Lemmas 15 and 22. &
Lemma 27. For all neighbors ~pp and ~qqAO;
maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgp4:
Proof. Since ~pp and ~qq are neighbors, they only differ in one position, so jm~pp m~qq jp1: Then
pað~ppÞpað~qqÞ
¼ pð~xxÞqð~xxÞ
� �c þ a z~pp
c þ a z~qq
� �¼ 1þ m~pp
1þ m~qq
� �c þ a z~pp
c þ a z~qq
� �p
2þ m~qq
1þ m~qq
� �2p4:
Finally, we note that a symmetric argument can be made for pað~qqÞ=pað~ppÞ: &
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 221
Corollary 28. When learning generalized DNF using Perceptron and a linear pð~xxÞ; a simulation of
M that starts at any node and is of length
Ti ¼32kn2ð1þ n=2Þ2
ðn þ 1Þ
!ðn ln k þ lnð4ai jMjÞ þ lnð1=e0ÞÞ
will draw samples from #paisuch that jj #pai
paijjpe0:
Proof. Lemmas 26 and 27 bound g and h; which we substitute directly into Theorem 5. Also, notethat WðaiÞpknðc þ ai jMjÞ and wai;~ppX1; completing the proof. &
4.2. Pruning ensembles of classifiers
We now apply our methods to pruning an ensemble, produced by e.g. AdaBoost [35].AdaBoost’s output is a set of functions hi :X-R; where iAf1;y; ng and X is the instancespace. Each hi is trained on a different distribution over the training examples and is associatedwith a parameter biAR that weights its predictions. Given an instance ~xxAX; the ensemble’s
prediction is Hð~xxÞ ¼ signðPn
i¼1 bihið~xxÞÞ: Thus signðhið~xxÞÞ is hi’s prediction on ~xx; jhið~xxÞj is its
confidence in its prediction, and bi weights AdaBoost’s confidence in hi: It has been shown that ifeach hi has error less than 1=2 on its distribution, then the error on the training set and thegeneralization error of Hð�Þ can be bounded. Strong bounds on Hð�Þ’s generalization error canalso be shown even if the boosting algorithm is run past the point where Hð�Þ’s error is zero [34].However, overfitting can still occur [26], i.e. sometimes better generalization can be achievedif some of the hi’s are discarded. So our goal is to find a weighted combination of all possibleprunings that performs not much worse in terms of generalization error than the best singlepruning.To predict nearly as well as the best pruning, we place every possible pruning in a pool (so
N ¼ 2n) and run WM. We start by computing Wþ and W; which are, respectively, the sums ofthe weights of the experts predicting a positive and a negative label on example ~xx: Then WMpredicts þ1 if Wþ4W and 1 otherwise. Whenever WM makes a prediction mistake, it reducesthe weights of all experts that predicted incorrectly by dividing them by a (see Section 2.1).As in Section 4.1, using a binary pð~xxÞ makes bounding the mixing time difficult, except in a
trivial sense. Thus we use a linear pð~xxÞ; which allows us to also incorporate each pruning’sconfidence in its prediction, and to use that confidence when updating the weights. Given anexample ~xxAX; we compute hið~xxÞ for all iAf1;y; ng: We then use our MCMC procedure to
compute Wþ; an estimate of Wþ ¼P
~ppAOþ pð~xxÞw~pp ; where pð~xxÞ ¼Pn
i¼1 pibihið~xxÞ; Oþ ¼f~ppAf0; 1gn:
Pni¼1 pi bi hið~xxÞX0g; w~pp ¼ az~pp ; z~pp ¼
P~xx AMc~xx pð~xxÞ; and M is the set of examples
for which a prediction mistake was made. A similar procedure is used to compute W: Then WM
predicts þ1 if Wþ4W and 1 otherwise.
Define the Markov chain M with state space Oþ (similarly, O) and that makes transitionsaccording to the description in Section 3. The chain corresponds to a random walk on the
Boolean hypercube truncated by a hyperplane. It is easy to show that all pairs of states in Oþ
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234222
(similarly, O) can communicate. To move from node~ppAOþ to~qqAOþ; first add to~pp all bits i in~qqand not in ~pp that correspond to positions where bi hiðxÞX0: Then delete from ~pp all bits i in ~pp andnot in ~qq that correspond to positions where bi hiðxÞo0: Then delete the unnecessary ‘‘positivebits’’ and add the necessary ‘‘negative bits’’. It is easy to see that all states between ~pp and ~qq are in
Oþ: Thus M is irreducible and hence ergodic by Lemma 1.As before, we let B be an upper bound18 on the sum of all updates made on any pruning. Then
it is straightforward to adapt Lemma 8’s proof to bound f ; fA½1=e; e� for WM when applying
Section 3’s procedure to estimate WðaÞ: But when applying Eq. (1), we must determine Wða1Þ ¼Wð1Þ ¼ jOþj: This is equivalent to counting the number of solutions to a 0–1 knapsack problem,
which is #P-complete. Thus in order to complete our computation of W; we must also estimate
jOþj: We do this by mapping the problem to the knapsack problem which is summarized inSection 2.4 and shown to have an FPRAS by Morris and Sinclair [28]. If we let the ‘‘weight’’ ofitem i in our problem be wi ¼ bi hið~xxÞ for an example ~xx; then the only difference between the two
problems is that the weights in the jOþj estimation problem may be negative. We now argue thatthey are still equivalent and thus we can directly apply the results of Morris and Sinclair. Given a
vector ~ppAf0; 1gn and a weight vector ~ww; let p0i ¼ 1 pi if wio0 and p0
i ¼ pi otherwise. Also, let
w0i ¼ jwij and b0 ¼
Pwio0 jwij: It is easy to argue that
Pni¼1 wipiX0 (which is the definition19 of Oþ)
if and only ifPn
i¼1 w0ip
0iXb0 (which is an instance of the knapsack problem). If we let sþ ¼P
wi40;pi¼1 wi; s ¼P
wio0;pi¼1wi; and s0 ¼P
wio0;pi¼0wi; then b0 ¼ s0 s: This is exactly what
is added to both sides of the inequalityPn
i¼1wi piX0 to getPn
i¼1w0i p0
iXb0: Thus we can efficiently
estimate jOþj to within a factor of e; allowing us to apply Corollary 3.
Corollary 29. When applying WM to learn a weighted combination of ensemble prunings, let the
sample size S ¼ J130 re2=e2n; jOþj be estimated to within e=2 of its true value with probability
X3=4 via the procedure outlined above, and M be simulated long enough for each sample such that
the variation distance between the empirical distribution and paiis at most e=ð10e2rÞ: Then for any
d40; WþðaÞ satisfies
Pr½ð1 eÞWþðaÞpWþðaÞpð1þ eÞWþðaÞ�X1 d;
and the same result applies to WðaÞ if jOj is well approximated.
Note that if Wþ=We½1e1þe;
1þe1e� for all trials, then our estimates of Wþ and W are (with
probability at least 1 d0t for trial t) sufficiently accurate to correctly determine whether or not
Wþ4W: Setting d0t ¼ d=2t yields a total probability of failure of at most20P
N
t¼1 d=2t ¼ d: Thus
ARTICLE IN PRESS
18 In contrast to Section 4.1, where B is by definition upper bounded by Winnow’s or Perceptron’s mistake bound, for
this application B could be arbitrarily large since it depends on the predictions of arbitrary hypotheses. Thus for the rest
of this section we implicitly assume that B is polynomial in all relevant parameters, i.e. that it is expressed in unary.19By negating all wi; we can use the same arguments to estimate jOj:20Recall from the proof of Theorem 2 that only Oðlog 1=d0Þ runs of the estimation procedure are needed to reduce the
probability of failure to d0:
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 223
under these conditions, our version of WM runs identically to the brute-force version, and we canapply WM’s mistake bounds. This yields the following corollary.
Corollary 30. Using the assumptions of Corollary 29, if Wþ=We½1þe1e;
1e1þe� for all t; then with
probability at least 1 d; the number of prediction mistakes made by this algorithm on any sequenceof examples is Oðnþ nÞ; where n is the number of hypotheses in the ensemble and n is the number of
mistakes made by the best pruning.
Now we bound the mixing time. Bounding g is easy, since when viewed as multisets, ~pp,~qq ¼~aa0,~ZZeð~pp;~qqÞ; which implies z~pp þ z~qq ¼ z~aa 0 þ z~ZZeð~pp;~qqÞ and pþa ð~ppÞpþa ð~qqÞ ¼ pþa ð~aa0Þpþa ð~ZZeð~pp;~qqÞÞ: Thus g ¼1: Further, if ~pp and ~qq are neighbors that differ in bit i; then
z~pp z~qq ¼X~xxAM
Xj:pj¼1
c~xxbjhjð~xxÞ X
j:qj¼1
c~xxbjhjð~xxÞ
0@
1A
pbi
X~xxAM
c~xx hið~xxÞ;
which implies that for any neighbors ~pp and ~qq; maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgpaBmax whereBmax ¼ maxjfbj
P~xxAMc~xx hjð~xxÞg:
Corollary 31. When learning a weighted combination of ensemble prunings with Weighted Majority,
if O ¼ f0; 1gn; then a simulation of M that starts at any node and is of length
Ti ¼ 2n2aBmax
i ðn ln 2þ nðBmax BminÞlnai þ ln 1=e0Þ
will draw samples from #paisuch that jj #pai
paijjpe0:
Proof. We substitute our bounds of g and h directly into Theorem 5 and note that
WðaiÞp2nanBmax
i and wai;~ppXanBmin
i ; completing the proof. &
The above mixing time bound is only polynomial if Bmax is logarithmic in all relevantparameters, which is unlikely. However, our analysis is handicapped by worst-case assumptions,as with Corollary 14. While it is unlikely that an efficient bound on the mixing time exists forgeneral ensembles with arbitrary classifiers and arbitrary distributions over examples, it is openwhether restricted cases could have better bounds. Further, in Section 5 we show that in practice,our algorithm performs much better than the worst-case theoretical results imply, especiallyconsidering that highly accurate estimates of the weighted sums are not needed so long as weknow whether or not Wþ4W:Note that in Corollary 31 we assume that O ¼ f0; 1gn; i.e. that all prunings classify ~xx as positive
(or negative), and the chain is an untruncated hypercube. This is because without such anassumption we cannot guarantee that a canonical path between two nodes does not leave O at anytime. A new approach, employing balanced, almost-uniform permutations has recently been
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234224
pioneered by Morris and Sinclair [28] and applied to the truncated Boolean hypercube problemwhen the chain’s stationary distribution is uniform (i.e. for counting the number of solutions tothe knapsack problem). It is reasonable that their technique could be generalized to the case of anon-uniform distribution, which would allow us to consider truncated hypercubes for WM andfor other algorithms.
4.3. Discussion
In examining the results and proofs related to using Winnow on DNF, we see some interestingdifferences. Recall from Theorem 5 that there are two functions that must be bounded in order tobound a chain’s mixing time: g; which bounds the ratio of pð~ppÞpð~qqÞ to pð~aa0Þpð~ZZeð~pp;~qqÞÞ; and h;which bounds the ratio of weights of neighboring nodes in the chain. Linear pð~xxÞ allows us tobound g with a polynomial since the z’s in the exponent of a all cancel out due to the linear natureof the weight updates. However, logistic pð~xxÞ (and probably binary pð~xxÞ) in the worst case getcharged a multiplicative factor of a for each update, yielding an upper bound of g that isexponential in jMj: In contrast, when bounding h in an adversarial setting, both linear and logisticpð~xxÞ allow the difference in the z’s to grow with jMj: Further, in the worst case we cannot boundjMj for linear, but we can for logistic and binary.A more fundamental difference arises when comparing multiplicative weight update algorithms
(Winnow and WM) to the additive weight update algorithm (Perceptron). The additive weightupdates prevent M’s stationary distribution from deviating very far from uniform (whencompared to the MWU algorithms). Thus for Perceptron, M mixes rapidly, even if jMj isexponentially large. However, from a learning-theoretic standpoint, Perceptron is not a goodchoice for learning DNF since Lemma 21’s bound on the number of mistakes (updates)Perceptron will make in the worst case is exponential, which is corroborated by Khardon et al.’slower bound [18].A natural extension of the results of Section 4.2 is to generalize the results to multiclass
predictions. This comes for free if the boosting algorithm used is AdaBoost.MH from Schapireand Singer [35], which reduces the multiclass boosting problem to a set of binary ones, allowingour results to fit into this framework. Alternatively, AdaBoost.M1 from Freund and Schapire [11]more directly addresses the multiclass problem by having each hypothesis predict its confidencethat ~xx belongs to class j: Then the ensemble’s prediction is the class that maximizes theseconfidence-rated predictions. That is, each class is tested individually and the one that scores thehighest is the predicted class for ~xx: To adapt our framework to this, rather than simply estimating
Wþ and W; we estimate W j for each class j and then predict the class with the maximum. Since
each W j estimate uses a separate Boolean hypercube truncated by a single hyperplane (the onethat separates prunings that predict class j from those predicting another class), we can bound themixing time using the same machinery developed in Section 4.2, assuming a version of Theorem 5exists for truncated cubes.Since one of the goals of pruning an ensemble of classifiers is to reduce its size, one may adopt
one of several heuristics, such as choosing the pruning that has highest weight in WM, the highestratio of weight to size, or the highest product of weight and diversity, where diversity is measuredby e.g. Kullback–Leibler divergence (see [26]). Let f ð~ppÞ be the function that one wants to
maximize. Then the goal is to find the ~ppAf0; 1gn that approximately maximizes f : To do this one
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 225
can define a newMarkov chainM0 whose transition probabilities are the same as forM in Section3 except that Step 3 is irrelevant (since there is no training example ~xx) and in Step 4, change the
transition probability to minf1; rf ð~pp 00Þf ð~ppÞg: The parameter r governs the shape of the stationarydistribution: r ¼ 1 implies a uniform distribution over all prunings, while a large value of r yieldsa distribution that peaks at prunings with large f ð~ppÞ: (This is a special case of simulated annealing
[19] where the temperature is held constant.) Lemma 1 obviously holds for M0; but it is an openproblem to bound how far from optimal its solution will be. Of course, other combinatorialoptimization methods such as genetic algorithms can also be applied here.Similarly, one issue with our DNF algorithm is that after training, we still require the training
examples and running M to evaluate the hypothesis on a new example. In lieu of this, one can,after training, search (using a modified chain or a GA as described above) for the terms with thelargest weights in Winnow. The result is a set of rules, and the prediction on a new example can bea thresholded sum of weights of satisfied rules, using the same threshold y: The only issue then isto determine how many terms to select. Since each example satisfies exactly 2n terms, for anexample to be classified as positive, the average weight of its satisfied terms must be at least y=2n:Thus one heuristic is to choose as many terms as possible with weight at least y=2n; stopping whenwe find ‘‘too many’’ (as specified by a parameter) terms with weight less than y=2n: Using thispruned set of rules, no additional false positives will occur, and in fact the number might bereduced. The only concern is causing extra false negatives.
5. Empirical results
The primary purpose of our experiments is to assess how well our algorithms will work inpractice, especially when compared to our worst-case bounds. A more thorough empiricalanalysis (as well as some heuristic optimizations) of our technique is given by Tao and Scott [41].The goal of our algorithms is to use the weighted sum approximations to accurately simulate
Winnow and WM, since an accurate simulation is required for us to apply Winnow’s and WM’serror bounds.21 We measure accuracy of simulation in several ways: (1) comparing weighted sumestimates to their true values (computed by brute-force implementations); (2) counting thenumber of times our algorithm predicts differently from brute-force (e.g. for Winnow, the fractionof weighted sum estimates that are on the opposite side of the threshold as the true weighted sum);and (3) measuring prediction error. The simulated data we used in our experiments should makelearning straightforward for brute-force, so low prediction error should correlate (to some extent)to simulation accuracy. This performance measure is especially useful on problems that are toolarge for brute-force to handle.
5.1. Learning DNF formulas
In our DNF experiments, we defined the instance space to be X ¼ f1; 2gn and the set of terms
to be P ¼ f0; 1; 2gn; i.e. ki ¼ 2 for all i: Recall that a term ~pp ¼ ðp1;y; pnÞAP is satisfied by
ARTICLE IN PRESS
21Conceivably, in some cases our algorithm may accidentally make fewer prediction mistakes when deviating from
brute-force implementations, but in the absence of a formal theory to characterize this, it is safer to assume that such
behavior is anomalous. Thus our primary goal is to measure how well we simulate brute-force.
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234226
example ~xx ¼ ðx1;y; xnÞAX if and only if 8 pi40; pi ¼ xi: So pi ¼ 0 ) xi is irrelevant for term ~ppand pi40 ) xi must equal pi for ~pp to be satisfied. Even though we do not have an analysis of itsmixing time, we used binary pð~xxÞ in our experiments.22 Even though using binary pð~xxÞ we could
define for each new example O ¼ f0; 1gn (i.e. an untruncated Boolean hypercube; see Section4.1.1), we chose to use as O a truncated (with a single hyperplane) version of P: We did this fortwo reasons. First, doing so allows us to evaluate the performance of our MCMC approach whenthe hypercube is truncated, which is more generally applicable and also currently lacking intheoretical results. Second, this experimental approach gives us a better idea of how quickly thetime complexity of a brute-force implementation grows as a function of n: Comparing this withthe time of our MCMC experiments tells us the minimum value of n for which our approach isfaster.We generated random (from a uniform distribution) K ¼ 5-term DNF formulas, using
nAf10; 15; 20g: So the total number of Winnow inputs was 310 ¼ 59049; 315 ¼ 1:43� 107; and
320 ¼ 3:49� 109: For each value of n there were nine training/testing set combinations, each with50 training examples and 50 testing examples. Examples were generated uniformly at random.Table 1 gives averaged23 results for n ¼ 10; indexed by S and T (‘‘BF’’ means brute-force).
‘‘GUESS’’ is the average error of the estimates ð¼ jguess actualj=actualÞ: ‘‘LOW’’ is the fractionof guesses that were oy when the actual value was 4y; and ‘‘HIGH’’ is symmetric. These are theonly times our algorithm deviates from brute-force. ‘‘PM’’ is the number of prediction mistakesmade by Winnow on the training set while learning (in all experiments, Winnow repeatedly madepasses over the training set until it correctly classified all training examples). This gives anevaluation of each algorithm in an on-line setting. After training was complete, we also evaluatedthe hypotheses on their respective test sets, i.e. in a batch learning setting. ‘‘GE’’ is thegeneralization error on the test set. Finally, ‘‘Stheo’’ and ‘‘Ttheo’’ are S and T from Corollaries 9and 14 that guarantee an error of GUESS given the values of r in our simulations using a ¼ 3=2:These latter two columns show how pessimistic the worst-case bounds are in contrast to whatworks in practice. In general, the results in the columns varied little across the different data sets:the standard deviations of the values were typically small when compared to the means.Both GUESS and HIGH are very sensitive to T but not as sensitive to S: LOW was negligible
due to the distribution of weights as training progressed: the term ~ppe ¼~00 (satisfied by all
examples) had high weights. Since all computations started at ~00 and the Markov chain M seeksout nodes with high weights, the estimates tended to be too high rather than too low. But this isless significant as S and T increase. For S ¼ 100 and T ¼ 300; training and testing with M wasslower than brute-force by a factor of over 108. The average value of r used was 20.79 (range was19–26).Since the run time of our algorithm varies linearly in r; we ran some experiments where we fixed
r rather than letting it be set as in Section 4.1. We set S ¼ 100; T ¼ 300 and rAf5; 10; 15; 20g: Theresults are in Table 2. This indicates that for the given parameter values, r can be reduced belowthat which is stipulated in Section 3.
ARTICLE IN PRESS
22When we compare actual mixing times to theoretical bounds, we will compare to theoretical bounds for logistic
pð~xxÞ; which is a reasonable approximation to the binary case.23The number of weight estimations made per row in the table varied due to a varying number of training rounds, but
typically was around 3000.
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 227
Results for n ¼ 15 appear in Table 3. The trends for n ¼ 15 are similar to those for n ¼ 10:Brute-force is faster than M at S ¼ 500 and T ¼ 1500; but only by a factor of 16. The averagevalue of r used was 31.52 (range was 19–40). As with n ¼ 10; r can be reduced to speed up thealgorithm, but at a cost of increasing the errors of the predictions (e.g. see Table 4). We ran thesame experiments with a training set of size 100 rather than 50 (the test set was still of size 50),summarized in Table 5. As expected, error on the guesses changes little, but GE is decreased.For n ¼ 20; no exact (brute-force) sums were computed since there are over 3 billion inputs. So
we only examined the prediction error of our algorithm. With S ¼ 1000; T ¼ 2000; r set as in
ARTICLE IN PRESS
Table 1
Results for n ¼ 10 and r chosen as in Section 3
S T GUESS LOW HIGH PM GE Stheo Ttheo
100 100 0:4713 0:0000 0:1674 35:67 0:0600 2:23� 105 1:772� 10104
100 200 0:1252 0:0017 0:0350 35:67 0:0533 3:16� 106 1:777� 10104
100 300 0:0634 0:0041 0:0172 37:89 0:0711 1:23� 107 1:780� 10104
100 500 0:0484 0:0091 0:0078 40:11 0:0844 2:11� 107 1:781� 10104
500 100 0:4826 0:0000 0:1594 34:67 0:1000 2:13� 105 1:772� 10104
500 200 0:1174 0:0000 0:0314 33:83 0:0600 3:60� 106 1:778� 10104
500 300 0:0441 0:0043 0:0145 34:22 0:0867 2:55� 107 1:781� 10104
500 500 0:0232 0:0034 0:0064 37:88 0:0800 9:16� 107 1:784� 10104
BF 36:56 0:0730
Table 2
Results for n ¼ 10; S ¼ 100; and T ¼ 300
r GUESS LOW HIGH PM GE
5 0:1279 0:0119 0:0203 40:67 0:084410 0:0837 0:0095 0:0189 38:33 0:086715 0:0711 0:0058 0:0159 37:78 0:080020 0:0638 0:0042 0:0127 36:22 0:0889BF 36:56 0:0730
Table 3
Results for n ¼ 15 and r chosen as in Section 3
S T GUESS LOW HIGH PM GE Stheo Ttheo
500 1500 0:0368 0:0028 0:0099 60:22 0:0700 5:01� 107 4:112� 10151
500 1800 0:0333 0:0040 0:0049 60:75 0:0675 6:12� 107 4:112� 10151
500 2000 0:0296 0:0035 0:0023 57:00 0:0675 7:68� 107 4:113� 10151
1000 1500 0:0388 0:0015 0:0042 56:25 0:0650 4:51� 107 4:111� 10151
1000 1800 0:0253 0:0006 0:0038 59:00 0:0775 1:06� 108 4:114� 10151
1000 2000 0:0207 0:0025 0:0020 49:00 0:0800 1:58� 108 4:115� 10151
BF 60:22 0:0800
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234228
Section 3, and a training set of size 100, the average number of prediction mistakes was 91.75 andthe average GE was 0:11: The average value of r used was 55 (range was 26–78), and the run timefor M was over 270 times faster than brute-force (brute-force was run on a small number ofexamples to estimate its time complexity for n ¼ 20). Thus for this case our algorithm provides asignificant speed advantage. When running our algorithm with a fixed value of r ¼ 30 (reducingtime per example by almost a factor of 2), GE increases to 0:1833:In summary, even though our experiments are for small values of n; they indicate that relatively
small values of S; T ; and r are sufficient to minimize our algorithm’s deviations from brute-forceWinnow. In addition, our algorithm becomes significantly faster than that of brute-forcesomewhere between n ¼ 15 and n ¼ 20; which is small for a machine learning problem. However,our implementation is still extremely slow, taking several days or longer to finish training whenn ¼ 20 (evaluating the learned hypothesis is also slow). Thus we are actively working onoptimizations to speed up learning and evaluation (see Section 6).
5.2. Ensemble pruning
For the Weighted Majority experiments, we used AdaBoost over decision shrubs (depth-2decision trees) generated by C4.5 [31] to learn hypotheses for an artificial two-dimensional data set(Fig. 1). The target concept is a circle of radius 10 and the examples are distributed around itscircumference, each point’s distance from the circle normally distributed with zero mean and unitvariance. By concentrating examples around the circular boundary and limiting each decisiontree’s depth, we required ensembles of multiple trees to achieve low classification error on thedata. We created an ensemble of 10 classifiers and simulated WM with24 SAf50; 75; 100g and
ARTICLE IN PRESS
Table 5
Results for n ¼ 15; S ¼ 500; and T ¼ 1500; and a training set of size 100
r GUESS LOW HIGH PM GE
10 0:0577 0:0046 0:0478 78.00 0:051120 0:0456 0:0032 0:0073 78.33 0:073330 0:0405 0:0044 0:0081 74.44 0:0689BF 80.11 0:0356
Table 4
Results for n ¼ 15; S ¼ 500; and T ¼ 1500; and a training set of size 50
r GUESS LOW HIGH PM GE
10 0:0572 0:0049 0:0132 59.22 0:107520 0:0444 0:0033 0:0063 61.22 0:075630 0:0407 0:0022 0:0047 62.00 0:0822BF 60.22 0:0800
24The estimation of jOþj required an order of magnitude larger values of S and T than did the estimation of the
ratios to get sufficiently low error rates.
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 229
TAf500; 750; 1000g on the set of 210 prunings and compared the values computed for Eq. (1) to
the true values from brute-force WM. The results are in Table 6: ‘‘jOþj’’ denotes the error of ourestimates of jOþj; ‘‘Xi’’ denotes the error of our estimates of the ratios WþðaiÞ=Wþðai1Þ; and‘‘WþðaÞ’’ denotes the error of our estimates of WþðaÞ: Finally, ‘‘DEPARTURE’’ indicates ouralgorithm’s departure from brute-force WM, i.e. in these experiments our algorithm perfectly
ARTICLE IN PRESS
-15
-10
-5
0
5
10
15
-15 -10 -5 0 5 10 15
"positives""negatives"
Fig. 1. The circle data set.
Table 6
Results for n ¼ 10 and r chosen as in Section 3
S T jOþj Xi WþðaÞ DEPARTURE
50 500 0:0423 0:00050 0:0071 0.0000
50 750 0:0332 0:00069 0:0061 0.0000
50 1000 0:0419 0:00068 0:0070 0.0000
75 500 0:0223 0:00067 0:0050 0.0000
75 750 0:0197 0:00047 0:0047 0.0000
75 1000 0:0276 0:00058 0:0055 0.0000
100 500 0:0185 0:00040 0:0047 0.0000
100 750 0:0215 0:00055 0:0050 0.0000
100 1000 0:0288 0:00044 0:0056 0.0000
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234230
emulated brute-force. Finally, we note that other results show that for n ¼ 30; S ¼ 200; andT ¼ 2000; our algorithm takes about 4:5 h=example to run, while brute-force takes about2:8 h=example: Thus we expect our algorithm to run faster than brute-force at about n ¼ 31 orn ¼ 32:
6. Conclusions and future work
MWU algorithms are particularly useful when the number of inputs is very large, since theirmistake bounds are logarithmic in the total number of inputs. However, only in specific cases is itknown how to compute the weighted sum of inputs efficiently to exploit this attribute efficiency.We presented a general, widely applicable method based on Markov chain Monte Carlo toestimate these weighted sums, along with theoretical and empirical analyses of these methods asapplied to learning DNF formulas and pruning ensembles of classifiers. Our theoretical results donot yield efficient algorithms for these problems, but they do provide machinery for potentiallyconducting average-case analyses of these algorithms on e.g. restricted classes of DNF and/or onrestricted distributions. Further, as a heuristic, our methods show promise: in experimental resultson simulated data, our algorithms perform much better than the worst-case theoretical resultsimply, especially considering that highly accurate estimates of the weighted sums are not neededso long as we know which side of the threshold the sum lies on. Also, our approach can very easilybe implemented on a parallel or clustered architecture since all samples are drawn independentlyof each other, and each term in Eq. (1) can be computed independently of the other terms.Recent work by Tao and Scott [41] includes a thorough empirical analysis of this method to
speed it up further. They conducted tests using data sets from the UCI Repository [1] andincluded experimenting with other sampling methods besides the Metropolis sampler [27] ofSection 3. They also utilized methods that stop sampling early when it is known what side of thethreshold the weighted sum will fall.It is open whether better mixing time bounds are possible for special cases of the DNF problem
of Section 4.1. For example, if one considers learning parity functions (or other classes offunctions) under the uniform distribution, can the bounds of Lemmas 12 and 13 be tightened tosub-exponential? Alternatively, are there special cases where Winnow with linear pð~xxÞ (Section4.1.1.3) has on average a sufficiently small mistake bound to make the mixing time polynomial?It should be possible to apply our results to other problems for which an MWU algorithm is
applicable with an exponential number of inputs. The key is to map the set of inputs to ahypercube (perhaps truncated), and use that space as the set of states for the Markov chain. Onesuch application would be to accelerate Winnow-based algorithms [13,37,42] for a learning modelthat generalizes the conventional multiple-instance learning model [9]. Other applications mightinclude using the Perceptron algorithm on problems where no kernel is available to exactlycompute the weighted sums.When Morris and Sinclair [28] solved the knapsack problem, they also generalized their result
to a hypercube truncated by multiple hyperplanes (though the number of hyperplanes must beconstant), rather than the single one that we consider in Section 4.2. Since the stationarydistribution of their chain was assumed uniform, a natural question to ask is whether their results
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 231
can be generalized to non-uniform distributions, and if there are applications of thisgeneralization to learning problems.There is also the question of how to elegantly choose S and T for empirical use to balance time
complexity and precision. While it is important to accurately estimate the weighted sums in orderto properly simulate WM and Winnow, some imperfections in simulation can be handled sinceincorrect simulation decisions can be treated as noise, which Winnow and WM can tolerate.Ideally, the algorithms would intelligently choose S and T based on past performance, perhaps(for Winnow) utilizing the brute-force upper bound of ay on all weights in a brute-force execution(since no promotions can occur past that point). So 8~pp; zð~ppÞp1þ Iloga ym: If this bound isexceeded during a run of Winnow, then one can increase S and T and run again.
Acknowledgments
The authors thank Mark Jerrum and Alistair Sinclair for their discussions, Jeff Jackson,Qingping Tao, and the COLT and JCSS reviewers for their helpful comments and Jeff Jackson forpresenting an earlier version of this paper at COLT. This work was supported in part by NSFGrants CCR-0092761 and CCR-9877080 with matching funds from UNL-CCIS and a Laymangrant, and was completed in part utilizing the Research Computing Facility of the University ofNebraska. Deepak Chawla performed this work at the University of Nebraska.
References
[1] C. Blake, E. Keogh, C.J. Merz, UCI repository of machine learning databases, http://www.ics.uci.edu/~mlearn/
MLRepository.html (2003).
[2] A. Blum, P. Chalasani, J. Jackson, On learning embedded symmetric concepts, in: Proceedings of the Sixth Annual
Workshop on Computational Learning Theory, ACM Press, New York, NY, 1993, pp. 337–346.
[3] A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, S. Rudich, Weakly learning DNF and characterizing
statistical query learning using Fourier analysis, in: Proceedings of 26th ACM Symposium on Theory of
Computing, 1994, pp. 253–262.
[4] N.H. Bshouty, Simple learning algorithms using divide and conquer, Comput. Complexity 6 (2) (1997) 174–194.
[5] N. H. Bshouty, J. Jackson, C. Tamon, More efficient PAC-learning of DNF with membership queries under the
uniform distribution, J. Comput. System Sci., to appear (early version in COLT 99).
[6] N. Cesa-Bianchi, Y. Freund, D. Helmbold, D. Haussler, R. Schapire, M. Warmuth, How to use expert advice,
J. ACM 44 (3) (1997) 427–485.
[7] D. Chawla, L. Li, S.D. Scott, Efficiently approximating weighted sums with exponentially many terms, in:
Proceedings of the 14th Annual Conference on Computational Learning Theory, 2001, pp. 82–98.
[8] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning
Methods, Cambridge, MA, Cambridge University Press, 2000.
[9] T.G. Dietterich, R.H. Lathrop, T. Lozano-Perez, Solving the multiple-instance problem with axis-parallel
rectangles, Artificial Intelligence 89 (1–2) (1997) 31–71.
[10] M. Dyer, A. Frieze, R. Kannan, A. Kapoor, U. Vazirani, A mildly exponential time algorithm for approximating
the number of solutions to a multidimensional knapsack problem, Combin. Probab. Comput. 2 (1993) 271–284.
[11] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting,
J. Comput. System Sci. 55 (1) (1997) 119–139.
[12] C. Gentile, M. K. Warmuth, Linear hinge loss and average margin, in: M.S. Kearns, S.A. Solla, D.A. Cohn (Eds.),
Advances in Neural Information Processing Systems, Vol. 11, MIT Press, Cambridge, MA, 1998, pp. 225–231.
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234232
[13] S.A. Goldman, S.K. Kwek, S.D. Scott, Agnostic learning of geometric patterns, J. Comput. System Sci. 6 (1)
(2001) 123–151.
[14] S.A. Goldman, S.D. Scott, Multiple-instance learning of real-valued geometric patterns, Ann. Math. Artificial
Intelligence 39 (3) (2003) 259–290.
[15] D.P. Helmbold, R.E. Schapire, Predicting nearly as well as the best pruning of a decision tree, Mach. Learning 27
(1) (1997) 51–68.
[16] M. Jerrum, A. Sinclair, The Markov chain Monte Carlo method: an approach to approximate counting and
integration, in: D. Hochbaum (Ed.), Approximation Algorithms for NP-Hard Problems, Boston, MA, PWS Pub.,
1996, pp. 482–520 (Chapter 12).
[17] M.R. Jerrum, L.G. Valiant, V.V. Vazirani, Random generation of combinatorial structures from a uniform
distribution, Theoret. Comput. Sci. 43 (1986) 169–188.
[18] R. Khardon, D. Roth, R. Servedio, Efficiency versus convergence of Boolean kernels for online learning
algorithms, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural Information Processing
Systems, Vol. 14, 2001, MIT Press, Cambridge, MA, pp. 423–430.
[19] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671–680.
[20] J. Kivinen, M.K. Warmuth, P. Auer, The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds
when few input variables are relevant, Artificial Intelligence 97 (1–2) (1997) 325–343.
[21] N. Littlestone, Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm, Mach.
Learning 2 (1988) 285–318.
[22] N. Littlestone, From on-line to batch learning, in: Proceedings of the Second Annual Workshop on
Computational Learning Theory, Morgan Kaufmann, Los Altos, CA, 1989, pp. 269–284.
[23] N. Littlestone, Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow, in:
Proceedings of the Fourth Annual Workshop on Computational Learning Theory, Morgan Kaufmann, San
Mateo, CA, 1991, pp. 147–156.
[24] N. Littlestone, M.K. Warmuth, The weighted majority algorithm, Inform. and Comput. 108 (2) (1994) 212–261.
[25] W. Maass, M.K. Warmuth, Efficient learning with virtual threshold gates, Inform. Comput. 141 (1) (1998) 66–83.
[26] D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in: Proceedings of the 14th International
Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, 1997, pp. 211–218.
[27] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculation by fast
computing machines, J. Chem. Phys. 21 (1953) 1087–1092.
[28] B. Morris, A. Sinclair, Random walks on truncated cubes and sampling 0–1 knapsack solutions, SIAM J.
Comput., to appear (early version in FOCS 99).
[29] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf, An introduction to kernel-based learning methods,
IEEE Trans. Neural Networks 12 (2) (2001) 181–201.
[30] F. Pereira, Y. Singer, An efficient extension to mixture techniques for prediction and decision trees, Mach.
Learning 36 (3) (1999) 183–199.
[31] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, Los Altos, CA, 1993.
[32] J.R. Quinlan, Bagging, boosting, and C4.5, in: Proceedings of the 13th National Conference on Aritificial
Intelligence, 1996, pp. 725–730.
[33] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain, Psych.
Rev. 65 (1958) 386–407 (Reprinted in Neurocomputing, MIT Press, Cambridge, MA, 1988).
[34] R.E. Schapire, Y. Freund, P. Bartlett, W.S. Lee, Boosting the margin: a new explanation for the effectiveness of
voting methods, Ann. Statist. 26 (5) (1998) 1651–1686.
[35] R.E. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions, Mach. Learning 37 (3)
(1999) 297–336.
[36] B. Scholkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and
Beyond, MIT Press, Cambridge, MA, 2002.
[37] S.D. Scott, J. Zhang, J. Brown, On generalized multiple-instance learning, Technical Report UNL-CSE-2003-5,
Dept. of Computer Science, University of Nebraska, 2003.
[38] A. Sinclair, Improved bounds for mixing rates of Markov chains and multicommodity flow, Combin. Probab.
Comput. 1 (1992) 351–370.
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 233
[39] E. Takimoto, M. Warmuth, Predicting nearly as well as the best pruning of a planar decision graph, Theoret.
Comput. Sci. 288 (2) (2002) 217–235.
[40] C. Tamon, J. Xiang, On the boosting pruning problem, in: Proceedings of the 11th European Conference on
Machine Learning, Springer, Berlin, 2000, pp. 404–412.
[41] Q. Tao, S. Scott, An analysis of MCMC sampling methods for estimating weighted sums in Winnow,
in: C.H. Dagli (Ed.), Artificial Neural Networks in Engineering, ASME Press, Fairfield, NJ, 2003, pp. 15–20.
[42] Q. Tao, S.D. Scott, A faster algorithm for generalized multiple-instance learning, in: Proceedings of the
Seventeenth Annual FLAIRS Conference, AAAI Press, Miami Beach, FL, 2004, to appear.
ARTICLE IN PRESS
D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234234