10.1.1.3.8641

download 10.1.1.3.8641

of 15

Transcript of 10.1.1.3.8641

  • 7/29/2019 10.1.1.3.8641

    1/15

    Decision Trees: More Theoretical Justificationfor Practical Algorithms (Extended Abstract)

    Amos Fiat and Dmitry Pechyony

    School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel{fiat,pechyony}@tau.ac.il

    Abstract. We study impurity-based decision tree algorithms such asCART, C4.5, etc., so as to better understand their theoretical under-pinnings. We consider such algorithms on special forms of functions anddistributions. We deal with the uniform distribution and functions thatcan be described as unate functions, linear threshold functions and read-once DNF.For unate functions we show that that maximal purity gain and maximalinfluence are logically equivalent. This leads us to the exact identifica-tion of unate functions by impurity-based algorithms given sufficientlymany noise-free examples. We show that for such class of functions thesealgorithms build minimal height decision trees. Then we show that ifthe unate function is a read-once DNF or a linear threshold functionsthen the decision tree resulting from these algorithms has the minimalnumber of nodes amongst all decision trees representing the function.Based on the statistical query learning model, we introduce the noise-tolerant version of practical decision tree algorithms. We show that whenthe input examples have small classification noise and are uniformly dis-tributed, then all our results for practical noise-free impurity-based al-

    gorithms also hold for their noise-tolerant version.

    1 Introduction

    Introduced in 1983 by Breiman et al. [3], decision trees are one of the few knowl-edge representation schemes which are easily interpreted and may be inferencedby very simple learning algorithms. The practical usage of decision trees is enor-mous (see [21] for a detailed survey). The most popular practical decision treealgorithms are CART ([3]), C4.5 ([22]) and their various modifications. The heartof these algorithms is the choice of splitting variables according to maximal pu-rity gain value. To compute this value these algorithms use various impurityfunctions. For example, CART employs the Gini index impurity function and

    C4.5 uses an impurity function based on entropy. We refer to this family ofalgorithms as impurity-based.

    The full version of the paper, containing all proofs, can be found online athttp://www.cs.tau.ac.il/pechyony/dt full.ps

    Dmitry Pechyony is a full-time student and thus this paper is eligible for the BestStudent Paper award according to conference regulations.

  • 7/29/2019 10.1.1.3.8641

    2/15

    Despite practical success, most commonly used algorithms and systems forbuilding decision trees lack strong theoretical basis. It would be interesting to

    obtain the bounds on the generalization errors and on the size of decision treesresulting from these algorithms given some predefined number of examples.

    1.1 Theoretical Justification of Practical Decision Tree BuildingAlgorithms

    There have been several results justifying theoretically practical decision treebuilding algorithms. Kearns and Mansour showed in [16] that if the function,used for labelling nodes of tree, is a weak approximator of the target func-tion then the impurity-based algorithms for building decision tree using Giniindex, entropy or the new index are boosting algorithms. This property ensuresdistribution-free PAC learning and arbitrary small generalization error given

    sufficiently input examples. This work was recently extended by Takimoto andMaruoka [23] for functions having more than two values and by Kalai and Serve-dio [14] for noisy examples.

    We restrict ourselves to the input of uniformly distributed examples. Weprovide new insight into practical impurity-based decision tree algorithms byshowing that for unate boolean functions, the choice of splitting variable accord-ing to maximal exact purity gain is equivalent to the choice of variable accordingto the maximal influence. Then we introduce the algorithm DTExactPG, whichis a modification of impurity-based algorithms that uses exact probabilities andpurity gain rather that estimates. The main results of our work are stated bythe following theorems (assuming f is unate):

    Theorem 1 The algorithm DTExactPG builds a decision tree representing

    f(x) and having minimal height amongst all decision trees representing f(x). Iff(x) is a boolean linear threshold function or a read-once DNF, then the treebuilt by the algorithm has minimal size amongst all decision trees representingf(x).

    Theorem 2 Leth be the minimal depth of decision tree representing f(x). Forany > 0, given O

    29h ln2 1

    = poly(2h, ln 1

    ) uniformly distributed noise-free

    random examples of f(x), with probability at least 1 , CART and C4.5 builda decision tree computing f(x) exactly. The tree produced has minimal heightamongst all decision trees representing f(x). If f(x) is read-once DNF or aboolean linear threshold function then the resulting tree has the minimal numberof nodes amongst all decision trees representing f(x).

    In case the input examples have classification noise with rate < 12 we introducea noise-tolerant version of impurity-based algorithms and obtain the same resultas for noise-free case:

    Theorem 3 For any > 0, given O

    29h ln2 1

    = poly(2h, ln 1

    ) uniformly dis-

    tributed random examples of f(x) corrupted by classification noise with constant

  • 7/29/2019 10.1.1.3.8641

    3/15

    Exact CART, C4.5, etc. Modification of CART, C4.5, etc., poly(2h)Function Exact Purity poly(2h) uniform uniform examples with

    Influence Gain noise-free examples small classification noiseUnate min height min height min height min heightBoolean LTF min size min size min size min size

    Read-once DNF min size min size min size min size

    Fig. 1. Summary of bounds on the size of decision trees, obtained in our work.

    Algorithm Model, Running Hypothesis Bounds on the FunctionDistribution Time Size of DT Learned

    Jackson and PAC, poly(2h) Decision none almostServedio [13] uniform Tree any DNF

    Impurity-Based any function

    Algorithms PAC, poly(( 1

    )c

    2 ) Decision none satisfying Weak

    (Kearns and any Tree HypothesisMansour [16]) AssumptionBshouty and PAC, Decision at most min-sized any

    Burroughs [4] any poly(2n) Tree DT representingthe function

    Kushilevitz and Mansour [18], PAC, examples Fourier N/A anyBshouty and Feldman [5], from uniform poly(2h) Series

    Bshouty et al. [6] random walk

    Impurity-Based PC (exact, poly(2h) Decision minimal height unateAlgorithms identification), Tree minimal size read-once DNF(our work) uniform boolean LTF

    Fig. 2. Summary of decision tree noise-free learning algorithms.

    rate, with probability at least 1, a noise-tolerant version of impurity-based al-gorithms builds a decision tree representing f(x). The tree produced has minimalheight amongst all decision trees representing f(x). If f(x) is read-once DNFor a boolean linear threshold function then the resulting tree has the minimalnumber of nodes amongst all decision trees representing f(x).

    Figure 1 summarizes the bounds on the size of decision trees, obtained in ourwork.

    1.2 Previous Work

    Building minimal height and minimal number of nodes decision tree consistentwith all given examples is NP-hard ([12]). The single polynomial-time deter-ministic approximation algorithm known today for approximating the height ofdecision trees is the simple greedy algorithm ([20]), achieving the factor O(ln(m))(m is the number of input examples). Combining the results of [11] and [8] itcan be shown that the depth of decision tree cannot be approximated within

  • 7/29/2019 10.1.1.3.8641

    4/15

    a factor (1 )ln(m) unless N P DTIME(nO(loglog(n))). Hancock et al.showed in [10] that the problem of building minimal number of nodes deci-

    sion tree cannot be approximated within a factor 2log

    OPT for any < 1, unlessNP RTIME[2poly logn].

    Blum et al. showed in [2] that decision trees cannot even be weakly learnedin polynomial time from statistical queries dealing with uniformly distributedexamples. Thus, there is no modification of existing decision tree learning algo-rithms to yield efficient polynomial-time statistical query learning algorithms forarbitrary functions. This result is an evidence for the difficulty of weak learn-ing (and thus also PAC learning) of decision trees of arbitrary functions in thenoise-free and noisy settings.

    Figure 2 summarizes the best results obtained by theoretical algorithms forlearning decision trees from noise-free examples. Most of them may be modified,to obtain corresponding noise-tolerant versions.

    Kearns and Valiant ([17]) proved that distribution-free weak learning of read-once DNF using any representation is equivalent to several cryptographic prob-lems widely believed to be hard. Mansour and Schain give in [19] an algorithmfor proper PAC-learning of read-once DNF in polynomial time from randomexamples taken from any maximum entropy distribution. This algorithm maybe easily modified to obtain polynomial-time probably correct learning in casethe underlying function has decision tree of logarithmic depth and input exam-ples are uniformly distributed, matching the performance of our algorithm inthis case. Using both membership and equivalence queries Angluin et al. showedin [1] the polynomial-time algorithm for exact identification of read-once DNFby read-once DNF using examples taken from any distribution.

    Boolean linear threshold functions are polynomially properly PAC learnablefrom both noise-free examples (folk result) and examples with small classificationnoise ([7]). In both cases the examples may be taken from any distribution.

    1.3 Structure of the Paper

    In Section 2 we give relevant definitions. In Section 3 we introduce a new algo-rithm DTInfluence for building decision trees using an oracle for influence andprove several properties of the resulting decision trees. In Section 4 we proveTheorem 1. In Section 5 we prove Theorem 2. In Section 6 we introduce thenoise-tolerant version of impurity-based algorithms and prove Theorem 3. InSection 7 we outline directions for further research.

    2 Background

    In this paper we use standard definitions of PAC ([24]) and statistical query ([15])learning models. All our results are in the PAC model with zero generalizationerror. We denote this model by PC (Probably Correct).

  • 7/29/2019 10.1.1.3.8641

    5/15

    2.1 Boolean Functions

    A boolean function (concept) is defined as f :{

    0, 1}

    n

    {0, 1

    }(for boolean

    formulas, e.g. read-once DNF) or as f : {1, 1}n {0, 1} (for arithmetic for-mulas, e.g. boolean linear threshold functions). Let xi be the i-th variable orattribute. Let x = (x1, . . . , xn), and f(x) be the target or classification. Thevector (x1, x2, . . . , xn, f(x)), is called an example. Let fxi=a(x), a {0, 1} bethe function f(x) restricted to xi = a. We refer to the assignment xi = a asa restriction. Given the set of restrictions R = {xi1 = a1, . . . , xik = ak}, therestricted function fR(x) is defined similarly. xi R iff there exists a restrictionxi = a R, where a is any value.

    A literal xi is a boolean variable xi itself or its negation xi. A term is aconjunction of literals and a DNF (Disjunctive Normal Form) formula is a dis-junction of terms. Let |F| be the number of terms in the DNF formula F and|ti| be the number of literals in the term ti. Essentially F is a set of terms

    F = {t1, . . . , t|F|} and ti is a set of literals, ti = {xi1 , . . . , xi|ti|}. The term ti issatisfied iff xi1 = . . . = xi|ti| = 1.

    If for all 1 i n, f(x) is monotone w.r.t. xi or xi then f(x) is a unatefunction. A DNF is read-once if each variable appears at most once. Given aweight vector a = (a1, . . . , an), such that for all 1 i n, ai , and athreshold t , the boolean linear threshold function (LTF) fa,t is fa,t(x) =n

    i=1 aixi > t.Let ei be the vector of n components, containing 1 in the i-th component

    and 0 in all other components. The influence of xi on f(x) under distributionD is If(i) = PrxD[f(x) = f(x ei)]. We use the notion of influence oracle asan auxiliary tool. The influence oracle runs in time O(1) and returns the exactvalue of If(i) for any f and i.

    2.2 Decision Trees

    In our work we restrict ourselves to binary univariate decision trees for booleanfunctions. So the definitions given below are adjusted to this model and are notgeneric. A decision tree T is a rooted DAG consisting of nodes and leaves. Eachnode in T, except the root, has in-degree 1 and out-degree 2. The in-degree ofthe root is 0. Each leaf has in-degree 1 and out-degree 0. The edges of T andleaves are labelled with 1 and 0.

    The classification of an input to f is done by traversing the tree from theroot to some leaf. Every node s ofT contains some test xi = 1? and the variablexi is called a splitting variable. The left (right) son of s is also called the 0-son(1-son) and is referred to as s

    0(s1

    ). Let c(l) be the label of the leaf l. Uponarriving to the node s, we pass the input x to the (xi = 1?)-son of s. Theclassification given to the input x by the T is denoted by cT(x). The path fromthe root to the node s corresponds to the set of restrictions of values of variablesleading to s. Similarly, the node s corresponds to the restricted function fR(x).In the sequel we use the identifier s of the node and its corresponding restrictedfunction interchangeably.

  • 7/29/2019 10.1.1.3.8641

    6/15

    x1=1?1

    11 00

    0

    x2=1? x3=1?

    0 0 11

    Fig. 3. Example of the decision tree representing f(x) = x1x3 x1x2

    DTApproxPG(s, X, R, )

    1: if all examples arriving to s have the same classification then2: Set s as a leaf with that value.3: else4: Choose xi = arg maxxiX{

    P G(fR, xi, )} to be a splitting variable.5: Run DTApproxPG(s1, X {xi}, R {xi = 1}, ).6: Run DTApproxPG(s0, X {xi}, R {xi = 0}, ).7: end if

    Fig. 4. DTApproxPG algorithm - generic structure of all impurity-based algorithms.

    The height of T, h(T), is the maximal length of path from the root to anynode. The sizeofT, |T|, is the number of nodes in T. A decision tree T representsf(x) iff f(x) = cT(x) for all x. An example of a decision tree, representing thefunction f(x) = x1x3 x1x2, is shown in Fig. 3.

    The function (x) : [0, 1] is an impurity function if it is concave, (x) =(1 x) for any x [0, 1] and (0) = (1) = 0. Examples of impurity functionsare the Gini index (x) = 4x(1 x) ([3]), the entropy function (x) = x log x(1 x) log(1 x) ([22]) and the new index (x) = 2

    x(1 x) ([16]). Let sa(i),

    a {0, 1}, denote the a-son of s that would be created if xi is placed at s as asplitting variable. For each node s let Pr[sa(i)], a {0, 1}, denote the probabilitythat random example from the uniform distribution arrives at sa(i) given thatit has already arrived at s. Let p(s) be the probability that a positive examplearrives to the node s. The impurity sum (IS) of xi at s using impurity function(x) is IS(s, xi, ) = Pr[s0(i)](p(s0(i))) + Pr[s1(i)](p(s1(i))). The purity gain(PG) of xi at s is: PG(s, xi, ) = (p(s)) IS(s, xi, ). The estimated values of

    all these quantities areP G, IS, etc.Figure 4 gives the structure of all impurity-based algorithms. The algorithm

    takes four parameters: s, identifying current trees node, X, standing for theset of attributes available for testing, R, which is a set of functions restrictionsleading to s and , identifying the impurity function. Initially s is set to the rootnode, X contains all attribute variables and R is an empty set.

    Since the value of (p(s)) is attribute independent, the choice of maximalPG(s, xi, ) is equivalent to the choice of minimal IS(s, xi, ). For uniformlydistributed examples Pr[s0(i)] = Pr[s1(i)] = 0.5. Thus if impurity sum is com-puted exactly, then (p(s0(i))) and (p(s1(i))) have equal weight. We define the

    balanced impurity sum of xi at s as BIS(s, xi, ) = (p(s0(i))) + (p(s1(i))).

  • 7/29/2019 10.1.1.3.8641

    7/15

    DTInfluence(s, X, R)

    1: if xi X, IfR(i) = 0 then2: Set classification ofs as a classification of any example arriving to it.3: else4: Choose xi = arg maxxiX{IfR(i)} to be a splitting variable.5: Run DTInfluence(s1,X {xi}, R {xi = 1}).6: Run DTInfluence(s0,X = {xi}, R {xi = 0}).

    7: end if

    Fig. 5. DTInfluence algorithm.

    3 Building Decision Trees Using an Influence Oracle

    In this section we introduce a new algorithm, DTInfluence (see Fig. 5), forbuilding decision trees using an influence oracle. This algorithm greedily choosesthe splitting variable with maximal influence. Clearly, the resulting tree consist-ing of only relevant variables. The algorithm takes three parameters, s, X and R,having the same meaning and initial values as in the algorithm DTApproxPG.

    Lemma 1 Letf(x) be any boolean function. Then the decision treeT built by thealgorithmDTInfluence represents f(x) and has no node such that all examplesarriving to it have the same classification.

    Proof See online full version ([9]).

    Lemma 2 Iff(x) is a unate function withn relevant variables then any decisiontree representing f(x) and consisting only of relevant variables has height n.

    Proof See online full version ([9]).

    Corollary 3 If f(x) is a unate function then the algorithmDTInfluence pro-duces a minimal height decision tree representing f(x).

    Proof Follows directly from Lemma 2.

    3.1 Read-Once DNF

    Let f(x) be a boolean function which can be represented by a read-once DNFF. In this section we prove the following lemma:

    Lemma 4 For any f(x) which can be represented by a read-once DNF, thedecision tree built by the algorithmDTInfluence has the minimal number ofnodes amongst all decision trees representing f(x).

    The proof of the Lemma 4 consists of two parts. In the first part of the proofwe introduce another algorithm, called DTMinTerm (see Figure 6). Then weprove Lemma 4 for the algorithm DTMinTerm. In the second part of the proof

    we show that the trees built by DTMinTerm and DTInfluence are the same.

  • 7/29/2019 10.1.1.3.8641

    8/15

    DTMinTerm(s, F)

    1: if ti F such that ti = then2: Set s as a positive leaf.3: else4: if F = then5: Set s as a negative leaf.6: else

    7: Let tmin = argmintiF{|ti|}. tmin = {xm1 , xm2 , . . . , xm|tmin|}.

    8: Choose any xmi tmin. Let t

    min = tmin\{xmi}.9: if xmi = xmi then

    10: Run DTMinTerm(s1, F\{tmin} {t

    min}), DTMinTerm(s0, F\{tmin}).11: else12: Run DTMinTerm(s0, F\{tmin} {t

    min}), DTMinTerm(s1, F\{tmin}).13: end if14: end if15: end if

    Fig. 6. DTMinTerm algorithm.

    Assume we are given read-once DNF formula F. We change the algorithmDTInfluence so that the splitting rule is to choose any variable xi in the small-est term tj F. The algorithm stops when the restricted function becomes con-stant (true or false). The new algorithm, denoted by DTMinTerm, is shownin Figure 6. The initial value of the first parameter of the algorithm is the sameas in DTInfluence, and the second parameter is initially set to functions DNFformula F. The following three lemmata are proved in the online full version([9]).

    Lemma 5 Given the read-once DNF formula F representing the functionf(x),the decision tree T, built by the algorithmDTMinTerm represents f(x) andhas the minimal number of nodes among all decision trees representing f(x).

    Lemma 6 Let xl ti and xm tj. If |ti| > |tj | then If(l) < If(m) and if|ti| = |tj | then If(l) = If(m).

    Lemma 7 LetX = {xi1, . . . , xik} be a set of variables presenting in the termsof minimal length of some read-once DNF F. For all x X there exists aminimal sized decision tree for f(x) with splitting variable x in the root.

    Proof (Lemma 4): It follows from Lemmata 6 and 7 that the trees produced bythe algorithms DTMinTerm and DTInfluence have the same size. Combiningthis result with the results of Lemmata 1 and 5, the current lemma follows.

    3.2 Boolean Linear Threshold Functions

    In this section we prove the following lemma:

  • 7/29/2019 10.1.1.3.8641

    9/15

    DTCoeff(s, X, ts)

    1: if

    xiX|ai| ts or

    xiX

    |ai| > ts then2: The function is constant. s is a leaf.3: else4: Choose a variable xi from X, having the largest |ai|.5: Run DTCoeff(s1, X {xi}, ts ai) and DTCoeff(s0, X {xi}, ts + ai).6: end if

    Fig. 7. DTCoeff algorithm.

    xi xj other variables function value

    v v w1, w2, w3, . . . , wn2 t1v v w1, w2, w3, . . . , wn2 t2

    v v w1, w2, w3, . . . , wn2 t3v v w1, w2, w3, . . . , wn2 t4

    Fig. 8. Structure of the truth table Gw from G(i, j).

    Lemma 8 For any linear threshold functionfa,t(x), the decision tree built by thealgorithmDTInfluence has the minimal number of nodes among all decision

    trees representing fa,t(x).

    The proof of the Lemma 8 consists of two parts. In the first part of the proofwe introduce another algorithm, called DTCoeff (see Fig. 7). Then we provethe Lemma 8 for the algorithm DTCoeff. In the second part of the proof weshow that the trees built by DTCoeff and DTInfluence have the same size.

    The difference between DTCoeff and DTinfuence is in the choice of split-ting variable. DTCoeff chooses the variable with the largest |ai|, and stopswhen the restricted function becomes constant (true or false). The meaning andinitial values of the first two parameters of the algorithm are the same as in DT-Influence, and the third parameter is initially set to the functions thresholdt.

    Lemma 9 Given the coefficient vector a, the decision tree T built by the algo-rithmDTCoeff represents fa,t(x) and has minimal number of nodes among alldecision trees representing fa,t(x).

    Proof Appears in the online full version ([9]) We now prove a sequence of lemmata connecting the influence and the co-

    efficients of variables in the threshold formula. Let xi and xj be two differentvariables in f(x). For each of the 2n2 possible assignments to the remainingvariables we get a 4 row truth table for different values of xi and xj . Let G(i, j)be the multi set of 2n2 truth tables, indexed by the assignment to the othervariables. I.e., Gw is the truth table where the other variables are assigned valuesw = w1, w2, . . . , wn2. The structure of a single truth table is shown in Fig. 8.In this figure, and generally from now on, v and v are constants in {1, 1}.

    Observe that If(i) is proportional to the sum over the 2n2

    Gws in G(i, j) of

  • 7/29/2019 10.1.1.3.8641

    10/15

    the number of times t1 = t2 plus the number of times t3 = t4. Similarly, If(j)is proportional to the sum over the 2n2 Gws in G(i, j) of the number of timest1 = t3 plus the number of times t2 = t4. We use these observations in the proofof the following lemma (see online full version [9]):

    Lemma 10 If If(i) > If(j) then |ai| > |aj |.

    Note that if If

    (i) = If

    (j) then there may be any relation between|ai|

    and|aj |

    .The next lemma shows that choosing the variables with the same influence inany order does not increase the size of the resulting decision tree. For any nodes, let Xs be the set of all variables in X which are untested on the path fromthe root to s. Let X(s) = {x1, . . . xk} be the variables having the same non-zeroinfluence, which in turn is the largest influence among the influences of variablesin Xs.

    Lemma 11 LetTi (Tj) be the smallest decision tree one may get when choosing

    any xi Xs (xj Xs) at s. Let |Topt| be the size of the smallest tree rooted ats. Then |Ti| = |Tj | = |Topt|.

    Proof The proof in by induction on k. For k = 1 the lemma trivially holds.Assume the lemma holds for all < k. Next we prove the lemma for k. Consider

    two attributes xi and xj from X(s) and possible values of targets in any truthtable Gw G(i, j). Since the underlying function is a boolean linear thresholdand If(i) = If(j), targets may have 4 forms:

    Type A. All rows in Gw have target value 0. Type B. All rows in Gw have target value 1. Type C. Target value f in Gw is defined as f = (aixi > 0 and ajxj > 0). Type D. Target value f in Gw is defined as f = (aixi > 0 or ajxj > 0).

    Consider the smallest tree T testing xi at s. There are 3 cases to be considered:

    1. Both sons of xi are leaves. Since If(i) > 0 and If(j) > 0 there is at leastone Gw G(i, j) having a target of type C or D. Thus no neither xi nor xjcannot determine the function and this case is impossible.

    2. Both sons ofxi are non-leaves. By the inductive hypothesis there exist rightand left smallest subtrees of xi, each one rooted with xj . Then xi and xjmay be interchanged to produce an equivalent decision tree T testing xj ats and having the same size.

    3. Exactly one of the sons of xi is a leaf.

    Let us consider the third case. By the inductive hypothesis the non-leaf son of stests xj . It is not hard to see (see online full version [9]) that in this case G(i, j)contains either truth tables with targets of type A and C or truth tables withtargets of type B and D (otherwise both sons of xi are non-leaves). In boththese cases some value of xj determines the value of the function. Therefore ifwe place the test xj = 1? at s, then exactly one of its sons is a leaf. Thus it canbe easily verified that testing xj and then xi or testing xi and then xj results in

    a tree of the same size (see [9]).

  • 7/29/2019 10.1.1.3.8641

    11/15

    DTExactPG(s, X, R, )

    1: if all examples arriving at s have the same classification then2: Set s as a leaf with that value.3: else4: Choose xi = arg maxxiX{P G(fR, xi, )} to be a splitting variable.5: Run DTExactPG(s1, X {xi}, R {xi = 1}, ).6: Run DTExactPG(s0, X {xi}, R {xi = 0}, ).

    7: end if

    Fig. 9. DTExactPG algorithm.

    Proof(Lemma 8) Combining Lemmata 9, 10 and 11 we obtain that thereexists a smallest decision tree having the same splitting rule as that of DTIn-fluence. Combining this result with Lemma 1 concludes the proof.

    4 Optimality of Exact Purity Gain

    In this section we introduce a new algorithm for building decision tree,DTExactPG, (see Fig. 9) using exact values of purity gain. The proofs pre-

    sented in this section are independent of the specific form of impurity functionand thus are valid for all impurity functions satisfying the conditions defined insection 2.2.

    The next lemma follows directly from the definition of the algorithm:

    Lemma 12 Letf(x) be any boolean function. Then the decision tree T built bythe algorithmDTExactPG represents f(x) and there exists no inner node suchthat all inputs arriving at it have the same classification.

    Lemma 13 For any boolean function f(x), uniformly distributed x, and anynode s, p(s0(i)) and p(s1(i)) are symmetric relative to p(s): |p(s1(i)) p(s)| =|p(s0(i)) p(s)| and p(s1(i)) = p(s0(i))

    ProofAppears in the full version of the paper ([9]).

    Lemma 14 For any unate boolean functionf(x), uniformly distributed inputx,and any impurity function , If(i) > If(j) P G(f, xi, ) > P G(f, xj , ).

    ProofSince x is distributed uniformly, it is sufficient to prove If(i) > If(j) BI S(f, xi, ) < BIS(f, xj , ). Let di be number of pairs of examples differ-ing only in xi and having different target value. Since all examples have equalprobability If(i) =

    di2n1 .

    Consider a split of node s according to xi. All positive examples arriving ats may be divided into two categories:

    1. Flipping the value of i-th attribute does not change the target value ofexample. Then the first half of such positive examples passes to s1 and thesecond half passes to s0. Consequently such positive examples contribute

    equally to the probabilities of positive examples in s1 and s0.

  • 7/29/2019 10.1.1.3.8641

    12/15

    2. Flipping the value of i-th attribute changes the target value of example.Consider such pair of positive and negative examples, differing only in xi.Since f(x) is unate, either all positive example in such pairs have xi = 1 andall negative examples in such pairs have xi = 0, or all positive example insuch pairs have xi = 0 and all negative examples in such pairs have xi = 1.Consequently either all such positive examples pass to s1 or all such positiveexamples pass to s0. Thus such examples increase the probability of positive

    examples in one of the nodes {s1, s0} and decrease the probability of positiveexamples in the other.

    Observe that the number of positive examples in the second category is essen-tially di. Thus If(i) > If(j) max{p(s1(i)), p(s0(i))} > max{p(s1(j)), p(s0(j))}.By Lemma 13, for all i, p(s1(i)) and p(s0(i)) are symmetric relative to p(s).Therefore, if max{p(s1(i)), p(s0(i))} > max{p(s1(j)), p(s0(j))} then the proba-bilities of xi are more distant from p(s) than those of xj . Consequently, due toconcavity of impurity function, BI S(f, xj , ) > BIS(f, xi, ).

    Proof Sketch (of Theorem 1) The first part of the theorem follows fromLemmata 14, 12 and 2. The second part of the theorem follows from Lem-mata 14, 6, 7, 11, 4, 8, 3 and 12. See online full version [9] for a complete proof.

    5 Optimality of Approximate Purity Gain

    The purity gain computed by practical algorithms is not exact. However undersome conditions approximate purity gain suffices. The proof of this result isbased on the following lemma (proved in the online full version [9]):

    Lemma 15 Letf(x) be a boolean function which can be represented by decisiontree of depth h and x is distributed uniformly then Pr(f(x) = 1) = r2h , r

    Z, 0 r 2h.

    Proof Sketch (Theorem 2) From Lemma 15 and Theorem 1, to obtain theequivalence of exact and approximate purity gains we need to compute all prob-abilities within accuracy at least 1

    22h (h is the minimal height of decision tree

    representing the function). We show that accuracy poly( 12h ) suffices for the equiv-alence. See the online full version ([9]) for complete proof.

    6 Noise-Tolerant Probably Correct Learning

    In this section we assume that each input example is misclassified with probabil-ity (noise rate) < 0.5. We introduce a reformulation of the practical impurity-based algorithms in terms of statistical queries. Since our noise-free algorithmslearn probably correctly, we would like to obtain the same results of probable

    correctness with noisy examples. Our definition of PC learning with noise is that

  • 7/29/2019 10.1.1.3.8641

    13/15

    DTStatQuery(s, X, R, , h)

    1: if Pr[fR = 1]( 122h

    ) > 1 122h

    then

    2: Set s as a positive leaf.3: else4: if Pr[fR = 1]( 1

    22h) < 1

    22hthen

    5: Set s as a negative leaf.6: else

    7: Choose xi = arg maxxiXP G(fR, xi, ,1

    24h ) to be a splitting variable.8: Run DTStatQuery(s1, X {xi}, R {xi = 1}, , h).9: Run DTStatQuery(s0, X {xi}, R {xi = 0}, , h).

    10: end if11: end if

    Fig. 10. DTStatQuery algorithm.

    the examples are noisy yet, nonetheless, we insist upon zero generalization error.Previous learning algorithms with noise (e.g. [15]) allow a non-zero generalizationerror.

    Let Pr[fR = 1]() be the estimate of Pr[fR = 1] within accuracy . Thealgorithm DTStatQuery, which is a reformulation of DTApproxPG in terms

    of statistical queries, is shown at Figure 10.

    Lemma 16 Let f(x) be unate boolean function. Then, for any impurity func-tion, DTStatQuery builds a minimal height decision tree representing f(x). Iff(x) is read-once DNF or a boolean linear threshold function then the resultingtree has also minimal number of nodes amongst all decision trees representingf(x).

    Proof Follows from Lemma 15 and Theorem 2. See full version of the paper([9]) for a complete proof.

    Kearns shows in [15] how to simulate statistical queries from examples cor-rupted by small classification noise. This simulation involves the estimation of. [15] shows that if statistical queries need to be computed within accuracy

    then should be estimated within accuracy /2 = (). Such an estima-tion may be obtained by taking 12 estimations of of the form i. Runningthe learning algorithm using each time different estimation we obtain 12 hy-potheses h1, . . . , h 1

    2. By the definition of , amongst these hypotheses there

    exists at least one hypothesis hj having the same generalization error as the sta-tistical query algorithm. Then [15] describes a procedure how to recognize thehypothesis having generalization error of at most . The nave approach to rec-ognize the minimal sized decision tree having zero generalization error amongsth1, . . . , h 1

    2 is to apply the procedure of [15] with =

    122n . However in this

    case this procedure requires about 2n noisy examples. Next we show how torecognize minimal size decision tree with zero generalization error using only

    poly(2h) uniformly distributed noisy examples.Let i = PrEX(U)[hi(x) = f(x)] be the generalization error of hi over the

    space of noisy examples. Clearly, i for all i, and j = . Moreover among

  • 7/29/2019 10.1.1.3.8641

    14/15

    12 estimations i = i of (i = 0, . . . , 12 1) there exists i = j such that

    | j| /2. Therefore our current goal is to find such j.Let i be the estimation of within accuracy /4. Then |j j| 1 then, using the argumenti H i j 1, we get that j = min{i1, i2}. If |i1i2| = 1 and |i1 i2 |

    2 ,

    then, since the accuracy of is /4, j = min{i1, i2}. The final subcase to beconsidered is |i1 i2 |